Big Data Research最新文献_第3页

Explainable malware detection through integrated graph reduction and learning techniques 可解释的恶意软件检测通过集成图约简和学习技术

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Research

Pub Date : 2025-08-19 DOI: 10.1016/j.bdr.2025.100555

Hesamodin Mohammadian, Griffin Higgins, Samuel Ansong, Roozbeh Razavi-Far, Ali A. Ghorbani

Recently, Control Flow Graphs and Function Call Graphs have gain attention in malware detection task due to their ability in representation the complex structural and functional behavior of programs. To better utilize these representations in malware detection and improve the detection performance, they have been paired with Graph Neural Networks (GNNs). However, the sheer size and complexity of these graph representation poses a significant challenge for researchers. At the same time, a simple binary classification provided by the GNN models is insufficient for malware analysts. To address these challenges, this paper integrates novel graph reduction techniques and GNN explainability in to a malware detection framework to enhance both efficiency and interpretability. Through our extensive evolution, we demonstrate that the proposed graph reduction technique significantly reduces the size and complexity of the input graphs, while maintaining the detection performance. Furthermore, the extracted important subgraphs using the GNNExplainer, provide better insights about the model's decision and help security experts with their further analysis.

近年来，控制流图和函数调用图由于能够表征程序复杂的结构和功能行为，在恶意软件检测任务中受到了广泛的关注。为了更好地利用这些表征在恶意软件检测中并提高检测性能，将它们与图神经网络（gnn）配对。然而，这些图形表示的规模和复杂性给研究人员带来了重大挑战。同时，GNN模型提供的简单的二值分类对于恶意软件分析来说是不够的。为了解决这些挑战，本文将新的图约简技术和GNN可解释性集成到恶意软件检测框架中，以提高效率和可解释性。通过我们广泛的进化，我们证明了所提出的图约简技术显着降低了输入图的大小和复杂性，同时保持了检测性能。此外，使用gninterpreter提取的重要子图提供了关于模型决策的更好的见解，并帮助安全专家进行进一步的分析。

{"title":"Explainable malware detection through integrated graph reduction and learning techniques","authors":"Hesamodin Mohammadian, Griffin Higgins, Samuel Ansong, Roozbeh Razavi-Far, Ali A. Ghorbani","doi":"10.1016/j.bdr.2025.100555","DOIUrl":"10.1016/j.bdr.2025.100555","url":null,"abstract":"<div><div>Recently, Control Flow Graphs and Function Call Graphs have gain attention in malware detection task due to their ability in representation the complex structural and functional behavior of programs. To better utilize these representations in malware detection and improve the detection performance, they have been paired with Graph Neural Networks (GNNs). However, the sheer size and complexity of these graph representation poses a significant challenge for researchers. At the same time, a simple binary classification provided by the GNN models is insufficient for malware analysts. To address these challenges, this paper integrates novel graph reduction techniques and GNN explainability in to a malware detection framework to enhance both efficiency and interpretability. Through our extensive evolution, we demonstrate that the proposed graph reduction technique significantly reduces the size and complexity of the input graphs, while maintaining the detection performance. Furthermore, the extracted important subgraphs using the GNNExplainer, provide better insights about the model's decision and help security experts with their further analysis.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100555"},"PeriodicalIF":4.2,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144863267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

NGLinker: Link prediction for node featureless networks NGLinker：无节点特征网络的链路预测

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Research

Pub Date : 2025-08-18 DOI: 10.1016/j.bdr.2025.100558

Yong Li , Jingpeng Wu , Zhongying Zhang

Link prediction is a paradigmatic problem with tremendous real-world applications in network science, which aims to infer missing links or future links based on currently observed partial nodes and links. However, conventional link prediction models are based on network structure, with relatively low prediction accuracy and lack universality and scalability. The performance of link prediction based on machine learning and artificial features is greatly influenced by subjective consciousness. Although graph embedding learning (GEL) models can avoid these shortcomings, it still poses some challenges. Because GEL models are generally based on random walks and graph neural networks (GNNs), their prediction accuracy is relatively ineffective, making them unsuitable for revealing hidden information in node featureless networks. To address these challenges, we present NGLinker, a new link prediction model based on Node2vec and GraphSage, which can reconcile the performance and accuracy in a node featureless network. Rather than learning node features with label information, NGLinker depends only on the local network structure. Quantitatively, we observe superior prediction accuracy of NGLinker and lab test imputations compared to the state-of-the-art models, which strongly supports that using NGLinker to predict three public networks and one private network and then conduct prediction results is feasible and effective. The NGLinker can not only achieve prediction accuracy in terms of precision and area under the receiver operating characteristic curve (AUC) but also acquire strong universality and scalability. The NGLinker model enlarges the application of the GNNs to node featureless networks.

链路预测是网络科学中一个具有广泛实际应用的典型问题，其目的是根据当前观察到的部分节点和链路推断缺失的链路或未来的链路。然而，传统的链路预测模型是基于网络结构的，预测精度较低，缺乏通用性和可扩展性。基于机器学习和人工特征的链接预测的性能受主观意识的影响很大。虽然图嵌入学习（GEL）模型可以避免这些缺点，但它仍然存在一些挑战。由于GEL模型通常基于随机行走和图神经网络（gnn），其预测精度相对较低，不适合在无节点特征网络中揭示隐藏信息。为了解决这些问题，我们提出了一种新的基于Node2vec和GraphSage的链路预测模型NGLinker，它可以在无节点特征网络中协调性能和准确性。与使用标签信息学习节点特征不同，NGLinker只依赖于局部网络结构。在定量上，我们观察到NGLinker和实验室测试估算的预测精度优于当前最先进的模型，这有力地支持了使用NGLinker预测三个公网和一个专网并进行预测结果的可行性和有效性。该nglink不仅能在精度和接收机工作特性曲线下面积上达到预测精度，而且具有较强的通用性和可扩展性。NGLinker模型扩大了gnn在无节点特征网络中的应用。

{"title":"NGLinker: Link prediction for node featureless networks","authors":"Yong Li , Jingpeng Wu , Zhongying Zhang","doi":"10.1016/j.bdr.2025.100558","DOIUrl":"10.1016/j.bdr.2025.100558","url":null,"abstract":"<div><div>Link prediction is a paradigmatic problem with tremendous real-world applications in network science, which aims to infer missing links or future links based on currently observed partial nodes and links. However, conventional link prediction models are based on network structure, with relatively low prediction accuracy and lack universality and scalability. The performance of link prediction based on machine learning and artificial features is greatly influenced by subjective consciousness. Although graph embedding learning (GEL) models can avoid these shortcomings, it still poses some challenges. Because GEL models are generally based on random walks and graph neural networks (GNNs), their prediction accuracy is relatively ineffective, making them unsuitable for revealing hidden information in node featureless networks. To address these challenges, we present NGLinker, a new link prediction model based on Node2vec and GraphSage, which can reconcile the performance and accuracy in a node featureless network. Rather than learning node features with label information, NGLinker depends only on the local network structure. Quantitatively, we observe superior prediction accuracy of NGLinker and lab test imputations compared to the state-of-the-art models, which strongly supports that using NGLinker to predict three public networks and one private network and then conduct prediction results is feasible and effective. The NGLinker can not only achieve prediction accuracy in terms of precision and area under the receiver operating characteristic curve (AUC) but also acquire strong universality and scalability. The NGLinker model enlarges the application of the GNNs to node featureless networks.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100558"},"PeriodicalIF":4.2,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144863266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Research on modeling of the imbalanced fraudulent transaction detection problem based on embedding-aware conditional GAN 基于嵌入感知条件GAN的不平衡欺诈交易检测问题建模研究

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Research

Pub Date : 2025-08-13 DOI: 10.1016/j.bdr.2025.100557

Luping Zhi , Wanmin Wang

Detecting fraudulent transactions in structured financial data presents significant challenges due to multimodal, non-Gaussian continuous variables, mixed-type features, and severe class imbalance. To address these issues, we propose an Embedding-Aware Conditional Generative Adversarial Network (EAC-GAN), which incorporates trainable label embeddings into both the generator and discriminator to enable semantically controlled synthesis of minority-class samples. In addition to adversarial training, EAC-GAN introduces an auxiliary classification objective, forming a joint optimization strategy that improves the fidelity and class consistency of generated data, especially for underrepresented classes. Experiments conducted on a real-world credit card dataset demonstrate that EAC-GAN achieves stable convergence even with limited labeled data. When combined with LightGBM classifiers, the synthetic samples generated by EAC-GAN significantly enhance fraud detection performance, yielding a precision of 96.8%, an AUC of 96.38%, an AUPRC of 83.89%, and an MCC of 88.94%. Furthermore, dimensionality reduction using Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) reveals that the generated samples closely align with the real data distribution and exhibit clear class separability in the latent space. These results underscore the effectiveness of EAC-GAN in synthesizing high-quality minority-class samples and improving downstream fraud detection, outperforming traditional oversampling techniques and baseline generative models.

由于多模态、非高斯连续变量、混合类型特征和严重的类不平衡，在结构化金融数据中检测欺诈交易提出了重大挑战。为了解决这些问题，我们提出了一个嵌入感知条件生成对抗网络（EAC-GAN），它将可训练的标签嵌入到生成器和鉴别器中，以实现少数类样本的语义控制合成。除了对抗性训练之外，EAC-GAN还引入了一个辅助分类目标，形成了一个联合优化策略，提高了生成数据的保真度和类别一致性，特别是对于代表性不足的类别。在真实的信用卡数据集上进行的实验表明，即使标记数据有限，EAC-GAN也能实现稳定的收敛。当与LightGBM分类器结合使用时，EAC-GAN生成的合成样本显著提高了欺诈检测性能，精度为96.8%，AUC为96.38%，AUPRC为83.89%，MCC为88.94%。此外，使用主成分分析（PCA）和t分布随机邻居嵌入（t-SNE）进行降维，表明生成的样本与真实数据分布紧密一致，并且在潜在空间中表现出明显的类可分性。这些结果强调了EAC-GAN在合成高质量少数类样本和改进下游欺诈检测方面的有效性，优于传统的过采样技术和基线生成模型。

{"title":"Research on modeling of the imbalanced fraudulent transaction detection problem based on embedding-aware conditional GAN","authors":"Luping Zhi , Wanmin Wang","doi":"10.1016/j.bdr.2025.100557","DOIUrl":"10.1016/j.bdr.2025.100557","url":null,"abstract":"<div><div>Detecting fraudulent transactions in structured financial data presents significant challenges due to multimodal, non-Gaussian continuous variables, mixed-type features, and severe class imbalance. To address these issues, we propose an Embedding-Aware Conditional Generative Adversarial Network (EAC-GAN), which incorporates trainable label embeddings into both the generator and discriminator to enable semantically controlled synthesis of minority-class samples. In addition to adversarial training, EAC-GAN introduces an auxiliary classification objective, forming a joint optimization strategy that improves the fidelity and class consistency of generated data, especially for underrepresented classes. Experiments conducted on a real-world credit card dataset demonstrate that EAC-GAN achieves stable convergence even with limited labeled data. When combined with LightGBM classifiers, the synthetic samples generated by EAC-GAN significantly enhance fraud detection performance, yielding a precision of 96.8%, an AUC of 96.38%, an AUPRC of 83.89%, and an MCC of 88.94%. Furthermore, dimensionality reduction using Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) reveals that the generated samples closely align with the real data distribution and exhibit clear class separability in the latent space. These results underscore the effectiveness of EAC-GAN in synthesizing high-quality minority-class samples and improving downstream fraud detection, outperforming traditional oversampling techniques and baseline generative models.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100557"},"PeriodicalIF":4.2,"publicationDate":"2025-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144863265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deep neural network modeling for financial time series analysis 金融时间序列分析的深度神经网络建模

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Research

Pub Date : 2025-06-09 DOI: 10.1016/j.bdr.2025.100553

Zheng Fang , Toby Cai

Modeling stock returns has often relied on multivariate time series analysis, and constructing an accurate model remains a challenging goal for both market investors and academic researchers. Stock return prediction typically involves multiple variables and a combination of long-term and short-term time series patterns. In this paper, we propose a new deep learning network, named DLS-TS-Net, to model stock returns and address this challenge. We apply DLS-TS-Net in multivariate time series forecasting. The network integrates a Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM) units, and Gated Recurrent Units (GRUs). DLS-TS-Net overcomes LSTM's insensitivity to linear components in stock market forecasting by incorporating a traditional autoregressive model. Experimental results demonstrate that DLS-TS-Net excels at capturing long-term trends in multivariate factors and short-term fluctuations in the stock market, outperforming traditional time series and machine learning models. Additionally, when combined with the investment strategies proposed in this paper, DLS-TS-Net shows superior performance in managing risk during extreme events

股票收益模型通常依赖于多变量时间序列分析，构建一个准确的模型对市场投资者和学术研究人员来说都是一个具有挑战性的目标。股票收益预测通常涉及多个变量以及长期和短期时间序列模式的组合。在本文中，我们提出了一个新的深度学习网络，命名为DLS-TS-Net，来模拟股票收益并解决这一挑战。我们将DLS-TS-Net应用于多元时间序列预测。该网络集成了卷积神经网络（CNN）、长短期记忆（LSTM）单元和门控循环单元（gru）。DLS-TS-Net通过引入传统的自回归模型，克服了LSTM在股市预测中对线性分量不敏感的缺点。实验结果表明，DLS-TS-Net在捕捉多变量因素的长期趋势和股票市场的短期波动方面表现出色，优于传统的时间序列和机器学习模型。此外，当与本文提出的投资策略相结合时，DLS-TS-Net在极端事件中的风险管理方面表现出卓越的性能

引用次数: 0

Time-synchronized sentiment labeling via autonomous online comments data mining: A multimodal information fusion on large-scale multimedia data 基于自主在线评论数据挖掘的时间同步情感标记：大规模多媒体数据的多模态信息融合

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Research

Pub Date : 2025-06-08 DOI: 10.1016/j.bdr.2025.100552

Jiachen Ma , Nazmus Sakib , Fahim Islam Anik , Sheikh Iqbal Ahamed

While temporal sentiment labels prove invaluable for video tagging, segmentation, and labeling tasks in multimedia studies, large-scale manual annotation remains cost and time-prohibitive. Emerging Online Time-Sync Comment (TSC) datasets offer promising alternatives for generating sentiment maps. However, limitations in existing TSC scope and a lack of resource-constrained data creation guidelines hinder broader use. This study addresses these challenges by proposing a novel system for automated TSC generation utilizing recent YouTube comments as a readily accessible source of time-synchronized data. The efficacy of our multi-platform data mining system is evaluated through extensive long-term trials, leading to the development and analysis of two large-scale TSC datasets. Benchmarking against original temporal Automatic Speech Recognition (ASR) sentiment annotations validates the accuracy of our generated data. This work establishes a promising method for automatic TSC generation, laying the groundwork for further advancements in multimedia research and paving the way for novel sentiment analysis applications.

虽然时间情感标签在多媒体研究中的视频标记、分割和标记任务中被证明是无价的，但大规模的人工注释仍然是成本和时间上的限制。新兴的在线时间同步评论（TSC）数据集为生成情感地图提供了有希望的替代方案。然而，现有TSC范围的限制和缺乏资源有限的数据创建指南阻碍了更广泛的使用。本研究通过提出一种新的系统来解决这些挑战，该系统利用最近的YouTube评论作为易于访问的时间同步数据来源，自动生成TSC。我们的多平台数据挖掘系统的有效性通过广泛的长期试验进行评估，从而开发和分析了两个大型TSC数据集。对原始时间自动语音识别（ASR）情感注释进行基准测试验证了我们生成数据的准确性。这项工作建立了一个有前途的自动生成TSC的方法，为多媒体研究的进一步发展奠定了基础，并为新的情感分析应用铺平了道路。

{"title":"Time-synchronized sentiment labeling via autonomous online comments data mining: A multimodal information fusion on large-scale multimedia data","authors":"Jiachen Ma , Nazmus Sakib , Fahim Islam Anik , Sheikh Iqbal Ahamed","doi":"10.1016/j.bdr.2025.100552","DOIUrl":"10.1016/j.bdr.2025.100552","url":null,"abstract":"<div><div>While temporal sentiment labels prove invaluable for video tagging, segmentation, and labeling tasks in multimedia studies, large-scale manual annotation remains cost and time-prohibitive. Emerging Online Time-Sync Comment (TSC) datasets offer promising alternatives for generating sentiment maps. However, limitations in existing TSC scope and a lack of resource-constrained data creation guidelines hinder broader use. This study addresses these challenges by proposing a novel system for automated TSC generation utilizing recent YouTube comments as a readily accessible source of time-synchronized data. The efficacy of our multi-platform data mining system is evaluated through extensive long-term trials, leading to the development and analysis of two large-scale TSC datasets. Benchmarking against original temporal Automatic Speech Recognition (ASR) sentiment annotations validates the accuracy of our generated data. This work establishes a promising method for automatic TSC generation, laying the groundwork for further advancements in multimedia research and paving the way for novel sentiment analysis applications.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100552"},"PeriodicalIF":3.5,"publicationDate":"2025-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144307271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Development of an integrated data system for regional tourism analysis in Italy: A microdata perspective 意大利区域旅游分析综合数据系统的开发：微数据视角

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Research

Pub Date : 2025-06-07 DOI: 10.1016/j.bdr.2025.100550

Samuele Cesarini, Fabrizio Antolini, Ivan Terraglia

This paper presents the development of an integrated data system tailored for the Italian regions, combining microdata from the Bank of Italy's and ISTAT's surveys. These datasets offer an in-depth analysis of both domestic and international aspects of tourism, framed within the theoretical context of the tourism determinants. By merging this integrated dataset with additional data from other statistical sources, this study offers a queryable relational database enabling granular regional analysis. Currently, tourism statistics in Italy are fragmented and do not provide a unified picture of tourism in its many aspects. The relational model's interoperability addresses Italy's fragmented tourism data landscape, and its data definition language represents an important step towards the creation of a unified tourism archive. Micro-data allows for different statistical analyses than those usually carried out with aggregated data, increasing knowledge of the dynamics of the sector.

本文介绍了为意大利地区量身定制的综合数据系统的开发，结合了意大利银行和ISTAT调查的微观数据。这些数据集在旅游决定因素的理论背景下，对国内和国际旅游方面进行了深入分析。通过将这个集成数据集与其他统计来源的其他数据合并，本研究提供了一个可查询的关系数据库，可以进行粒度区域分析。目前，意大利的旅游统计数据是支离破碎的，不能提供一个统一的旅游业的许多方面的画面。关系模型的互操作性解决了意大利支离破碎的旅游数据格局，其数据定义语言代表了创建统一旅游档案的重要一步。与通常使用汇总数据进行的统计分析相比，微观数据允许进行不同的统计分析，从而增加了对该部门动态的了解。

引用次数: 0

BETM: A new pre-trained BERT-guided embedding-based topic model BETM：一种新的预训练bert引导的基于嵌入的主题模型

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Research

Pub Date : 2025-06-06 DOI: 10.1016/j.bdr.2025.100551

Yang Liu , Xiaotang Zhou , Zhenwei Zhang , Xiran Yang

The application of topic models and pre-trained BERT is becoming increasingly widespread in Natural Language Processing (NLP), but there is no standard method for incorporating them. In this paper, we propose a new pre-trained BERT-guided Embedding-based Topic Model (BETM). Through constraints on the topic-word distribution and document-topic distributions, BETM can ingeniously learn semantic information, syntactic information and topic information from BERT embeddings. In addition, we design two solutions to improve the problem of insufficient contextual information caused by short input and the issue of semantic truncation caused by long put in BETM. We find that word embeddings of BETM are more suitable for topic modeling than pre-trained GloVe word embeddings, and BETM can flexibly select different variants of the pre-trained BERT for specific datasets to obtain better topic quality. And we find that BETM is good at handling large and heavy-tailed vocabularies even if it contains stop words. BETM obtained the State-Of-The-Art (SOTA) on several benchmark datasets - Yelp Review Polarity (106,586 samplest), Wiki Text 103 (71,533 samples), Open-Web-Text (35,713 samples), 20Newsgroups (10,899 samples), and AG-news (127,588 samples).

主题模型和预训练BERT在自然语言处理（NLP）中的应用越来越广泛，但目前还没有一个标准的方法来整合它们。本文提出了一种新的预训练bert引导的基于嵌入的主题模型（BETM）。通过对主题-词分布和文档-主题分布的约束，BETM可以巧妙地从BERT嵌入中学习语义信息、句法信息和主题信息。此外，针对BETM中短输入导致的上下文信息不足和长输入导致的语义截断问题，我们设计了两种解决方案。我们发现BETM的词嵌入比预训练好的GloVe词嵌入更适合于主题建模，并且BETM可以针对特定数据集灵活地选择预训练BERT的不同变体，从而获得更好的主题质量。我们发现，即使包含停止词，BETM也能很好地处理大而重尾的词汇。BETM在几个基准数据集上获得了最先进的(SOTA) - Yelp Review Polarity（106,586个样本），Wiki Text 103（71,533个样本），Open-Web-Text（35,713个样本），20Newsgroups（10,899个样本）和AG-news（127,588个样本）。

{"title":"BETM: A new pre-trained BERT-guided embedding-based topic model","authors":"Yang Liu , Xiaotang Zhou , Zhenwei Zhang , Xiran Yang","doi":"10.1016/j.bdr.2025.100551","DOIUrl":"10.1016/j.bdr.2025.100551","url":null,"abstract":"<div><div>The application of topic models and pre-trained BERT is becoming increasingly widespread in Natural Language Processing (NLP), but there is no standard method for incorporating them. In this paper, we propose a new pre-trained BERT-guided Embedding-based Topic Model (BETM). Through constraints on the topic-word distribution and document-topic distributions, BETM can ingeniously learn semantic information, syntactic information and topic information from BERT embeddings. In addition, we design two solutions to improve the problem of insufficient contextual information caused by short input and the issue of semantic truncation caused by long put in BETM. We find that word embeddings of BETM are more suitable for topic modeling than pre-trained GloVe word embeddings, and BETM can flexibly select different variants of the pre-trained BERT for specific datasets to obtain better topic quality. And we find that BETM is good at handling large and heavy-tailed vocabularies even if it contains stop words. BETM obtained the State-Of-The-Art (SOTA) on several benchmark datasets - Yelp Review Polarity (106,586 samplest), Wiki Text 103 (71,533 samples), Open-Web-Text (35,713 samples), 20Newsgroups (10,899 samples), and AG-news (127,588 samples).</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100551"},"PeriodicalIF":3.5,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144270762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Bankruptcy risk prediction: A new approach based on compositional analysis of financial statements 破产风险预测：基于财务报表成分分析的新方法

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Research

Pub Date : 2025-05-23 DOI: 10.1016/j.bdr.2025.100537

Alessandro Magrini

The development of models for bankruptcy risk prediction has gained much attention in recent years due to the great availability of financial statement data. Most existing predictive models rely on financial ratios, which are performance-based measures expressing the relative magnitude of two accounting items. Despite the popularity of financial ratios, their use is notoriously accompanied by serious practical drawbacks, like the occurrence of outliers and redundancy, making data preprocessing necessary to avoid computational problems and obtain a good predictive accuracy. Isometric log ratios can potentially overcome these problems because they are designed to represent compositional data efficiently and have a logarithmic form that limits the occurrence of outliers. However, although they are not novel in the analysis of financial statements, no study has ever employed them to predict bankruptcy. In this article, we show the effectiveness of isometric log ratios to detect bankruptcy events in a sample of 138,720 Italian firms (127,420 active and 11,300 bankrupted) belonging to different industries and with different size and age. For this purpose, we use logistic regression with adaptive LASSO regularization and random forests to construct several predictive models featuring either financial ratios or isometric log ratios, and combining different horizons and lag structures. The results show that a set of 8 isometric log ratios provides, without preprocessing, almost the same predictive accuracy as a selection of 16 financial ratios that requires dropping 3.6% of the data. Also, the adaptive LASSO regularization reveals that redundancy for isometric log ratios is always below 20%, and in some cases near 0%, while it ranges from 12.5% to 46.9% for financial ratios. The predictive accuracy of models based on logistic regression is in line with and even higher than the one reported by recent studies, and random forests achieve a gain in the area under the Receiver Operating Characteristic (ROC) curve ranging between two and three percentage points.

近年来，由于财务报表数据的可获得性很大，破产风险预测模型的发展受到了广泛关注。大多数现有的预测模型依赖于财务比率，这是一种基于业绩的指标，表示两个会计项目的相对大小。尽管财务比率很受欢迎，但众所周知，它们的使用伴随着严重的实际缺陷，如异常值和冗余的出现，使得数据预处理成为必要，以避免计算问题并获得良好的预测准确性。等距对数比可以潜在地克服这些问题，因为它们被设计为有效地表示组成数据，并且具有限制异常值出现的对数形式。然而，尽管它们在分析财务报表方面并不新颖，但还没有研究使用它们来预测破产。在本文中，我们展示了等距对数比率在138,720家意大利公司（127,420家活跃公司和11,300家破产公司）中检测破产事件的有效性，这些公司属于不同的行业，具有不同的规模和年龄。为此，我们使用具有自适应LASSO正则化和随机森林的逻辑回归来构建几个具有财务比率或等距对数比率的预测模型，并结合不同的视界和滞后结构。结果表明，在没有预处理的情况下，一组8个等距对数比率提供的预测精度几乎与选择16个财务比率相同，这需要减少3.6%的数据。此外，自适应LASSO正则化表明，等距对数比率的冗余度始终低于20%，在某些情况下接近0%，而财务比率的冗余度在12.5%至46.9%之间。基于logistic回归的模型预测精度符合甚至高于近期研究报道，随机森林在受试者工作特征（ROC）曲线下的面积增加了2 - 3个百分点。

{"title":"Bankruptcy risk prediction: A new approach based on compositional analysis of financial statements","authors":"Alessandro Magrini","doi":"10.1016/j.bdr.2025.100537","DOIUrl":"10.1016/j.bdr.2025.100537","url":null,"abstract":"<div><div>The development of models for bankruptcy risk prediction has gained much attention in recent years due to the great availability of financial statement data. Most existing predictive models rely on financial ratios, which are performance-based measures expressing the relative magnitude of two accounting items. Despite the popularity of financial ratios, their use is notoriously accompanied by serious practical drawbacks, like the occurrence of outliers and redundancy, making data preprocessing necessary to avoid computational problems and obtain a good predictive accuracy. Isometric log ratios can potentially overcome these problems because they are designed to represent compositional data efficiently and have a logarithmic form that limits the occurrence of outliers. However, although they are not novel in the analysis of financial statements, no study has ever employed them to predict bankruptcy. In this article, we show the effectiveness of isometric log ratios to detect bankruptcy events in a sample of 138,720 Italian firms (127,420 active and 11,300 bankrupted) belonging to different industries and with different size and age. For this purpose, we use logistic regression with adaptive LASSO regularization and random forests to construct several predictive models featuring either financial ratios or isometric log ratios, and combining different horizons and lag structures. The results show that a set of 8 isometric log ratios provides, without preprocessing, almost the same predictive accuracy as a selection of 16 financial ratios that requires dropping 3.6% of the data. Also, the adaptive LASSO regularization reveals that redundancy for isometric log ratios is always below 20%, and in some cases near 0%, while it ranges from 12.5% to 46.9% for financial ratios. The predictive accuracy of models based on logistic regression is in line with and even higher than the one reported by recent studies, and random forests achieve a gain in the area under the Receiver Operating Characteristic (ROC) curve ranging between two and three percentage points.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100537"},"PeriodicalIF":3.5,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144138321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mortality and risk of cardiovascular diseases by age at retirement in three Italian cohorts 意大利三个队列按退休年龄划分的心血管疾病死亡率和风险

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Research

Pub Date : 2025-05-23 DOI: 10.1016/j.bdr.2025.100543

Chiara Ardito , Roberto Leombruni , Giuseppe Costa , Angelo d’Errico

The relationship between age at retirement and subsequent physical health appears still contradictory in the literature, with more recent studies suggesting possible adverse health effects linked to employment at later ages. Aim of this study was to assess the long-term risk of overall mortality and incidence of cardiovascular diseases (CVDs) associated with age at retirement in three large Italian cohorts using both survey and administrative data.

The risk of mortality and CVDs associated with age at retirement, kept continuous, was assessed separately for gender using age-adjusted Cox models, further controlled for chronic morbidity, education, socioeconomic and previous working characteristics. In another analysis, age at retirement was examined treating it as a dichotomous variable, comparing, in a set of analyses with age at retirement from 52 to 65 years, the incidence of the health outcomes among subjects who retired after a certain age, compared to those who retired up to that age.

Higher age at retirement was associated with significantly higher mortality among men in the three cohorts, while among women the association was not significant, although in the same direction as for men. The risk of CVDs was also significantly associated with higher age at retirement in all the datasets among men, and in two of them among women. The set of the analyses on age at retirement dichotomized confirmed the results based on continuous age at retirement for both genders. Several robustness analyses, including IV Poisson instrumental variable, confirm the validity of results for men, whereas female results were less stable and robust.

Policy makers should be aware of the risk for public heath of policies that increase retirement age.

在文献中，退休年龄与随后的身体健康之间的关系似乎仍然是矛盾的，最近的研究表明，年龄较晚的就业可能对健康产生不利影响。本研究的目的是利用调查和管理数据评估意大利三个大型队列中与退休年龄相关的总死亡率和心血管疾病（cvd）发病率的长期风险。死亡率和心血管疾病风险与退休年龄相关，保持连续性，使用年龄调整的Cox模型分别评估性别，进一步控制慢性发病率、教育、社会经济和以前的工作特征。在另一项分析中，对退休年龄进行了检查，将其作为一个二分变量，在一组与退休年龄从52岁到65岁的分析中，比较了在某一年龄之后退休的受试者与在该年龄之前退休的受试者的健康结果发生率。在这三个队列中，退休年龄越高的男性死亡率越高，而在女性中，尽管与男性的方向相同，但这种关联并不显著。在所有的男性数据集中，心血管疾病的风险也与较高的退休年龄显著相关，其中两个是女性数据集。在连续退休年龄的基础上，对两种性别的退休年龄进行了二分类分析，证实了这一结果。包括泊松工具变量在内的几个稳健性分析证实了男性结果的有效性，而女性结果则不那么稳定和稳健。决策者应该意识到提高退休年龄的政策对公共健康的风险。

{"title":"Mortality and risk of cardiovascular diseases by age at retirement in three Italian cohorts","authors":"Chiara Ardito , Roberto Leombruni , Giuseppe Costa , Angelo d’Errico","doi":"10.1016/j.bdr.2025.100543","DOIUrl":"10.1016/j.bdr.2025.100543","url":null,"abstract":"<div><div>The relationship between age at retirement and subsequent physical health appears still contradictory in the literature, with more recent studies suggesting possible adverse health effects linked to employment at later ages. Aim of this study was to assess the long-term risk of overall mortality and incidence of cardiovascular diseases (CVDs) associated with age at retirement in three large Italian cohorts using both survey and administrative data.</div><div>The risk of mortality and CVDs associated with age at retirement, kept continuous, was assessed separately for gender using age-adjusted Cox models, further controlled for chronic morbidity, education, socioeconomic and previous working characteristics. In another analysis, age at retirement was examined treating it as a dichotomous variable, comparing, in a set of analyses with age at retirement from 52 to 65 years, the incidence of the health outcomes among subjects who retired after a certain age, compared to those who retired up to that age.</div><div>Higher age at retirement was associated with significantly higher mortality among men in the three cohorts, while among women the association was not significant, although in the same direction as for men. The risk of CVDs was also significantly associated with higher age at retirement in all the datasets among men, and in two of them among women. The set of the analyses on age at retirement dichotomized confirmed the results based on continuous age at retirement for both genders. Several robustness analyses, including IV Poisson instrumental variable, confirm the validity of results for men, whereas female results were less stable and robust.</div><div>Policy makers should be aware of the risk for public heath of policies that increase retirement age.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100543"},"PeriodicalIF":3.5,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144205025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The influence of China's exchange rate market on the Belt and Road trade market: Based on temporal two-layer networks 中国汇率市场对“一带一路”贸易市场的影响——基于时间双层网络

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Research

Pub Date : 2025-05-19 DOI: 10.1016/j.bdr.2025.100540

Xiaoyu Zhang , Ye Pan , Lilan Tu

From 2010 to 2023, this research utilizes daily closing exchange rate data for countries participating in the Belt and Road Initiative (BRI) as well as China’s import and export volumes with these countries. Taking the renminbi (RMB) as the base currency and the other BRI currencies as quote currencies, we employ the Autoregressive Distributed Lag (ARDL) model to propose an algorithm for constructing a temporal two-layer network, resulting in the exchange-rate-trade network composed of 14 subnetworks. Through an analysis of the network’s topological structure, we observe that 2013 marks a significant turning point, after which the network transitions from a decentralized to a more centralized form. To assess the annual impact of China’s exchange rate and trade from 2010 to 2023, we introduce a comprehensive index for identifying key nodes within the network. Our findings based on this index indicate that: (1) Lebanon, Kyrgyzstan, and other diverse countries and regions emerge as key nodes, demonstrating China’s close economic ties with these countries and reflecting the substantial influence of RMB internationalization; and (2) compared with other years, China’s exchange rate market exerts notably stronger influence on the trade market in 2018, 2021, 2022, and 2023.

从2010年到2023年，本研究使用了参与“一带一路”倡议（BRI）的国家的每日收盘汇率数据以及中国与这些国家的进出口贸易额。以人民币为基准货币，其他一带一路货币为报价货币，采用自回归分布滞后（ARDL）模型，提出了一种构建时间双层网络的算法，得到了由14个子网络组成的汇率-贸易网络。通过对网络拓扑结构的分析，我们观察到2013年标志着一个重要的转折点，之后网络从分散的形式转变为更集中的形式。为了评估从2010年到2023年中国汇率和贸易的年度影响，我们引入了一个综合指数来识别网络中的关键节点。基于该指数的研究结果表明：(1)黎巴嫩、吉尔吉斯斯坦等不同国家和地区成为关键节点，表明中国与这些国家的经济联系密切，反映了人民币国际化的实质性影响；(2) 2018年、2021年、2022年和2023年中国汇率市场对贸易市场的影响明显强于其他年份。

{"title":"The influence of China's exchange rate market on the Belt and Road trade market: Based on temporal two-layer networks","authors":"Xiaoyu Zhang , Ye Pan , Lilan Tu","doi":"10.1016/j.bdr.2025.100540","DOIUrl":"10.1016/j.bdr.2025.100540","url":null,"abstract":"<div><div>From 2010 to 2023, this research utilizes daily closing exchange rate data for countries participating in the Belt and Road Initiative (BRI) as well as China’s import and export volumes with these countries. Taking the renminbi (RMB) as the base currency and the other BRI currencies as quote currencies, we employ the Autoregressive Distributed Lag (ARDL) model to propose an algorithm for constructing a temporal two-layer network, resulting in the exchange-rate-trade network composed of 14 subnetworks. Through an analysis of the network’s topological structure, we observe that 2013 marks a significant turning point, after which the network transitions from a decentralized to a more centralized form. To assess the annual impact of China’s exchange rate and trade from 2010 to 2023, we introduce a comprehensive index for identifying key nodes within the network. Our findings based on this index indicate that: (1) Lebanon, Kyrgyzstan, and other diverse countries and regions emerge as key nodes, demonstrating China’s close economic ties with these countries and reflecting the substantial influence of RMB internationalization; and (2) compared with other years, China’s exchange rate market exerts notably stronger influence on the trade market in 2018, 2021, 2022, and 2023.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100540"},"PeriodicalIF":3.5,"publicationDate":"2025-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144147764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0