Pub Date : 2025-08-28Epub Date: 2025-06-09DOI: 10.1016/j.bdr.2025.100553
Zheng Fang , Toby Cai
Modeling stock returns has often relied on multivariate time series analysis, and constructing an accurate model remains a challenging goal for both market investors and academic researchers. Stock return prediction typically involves multiple variables and a combination of long-term and short-term time series patterns. In this paper, we propose a new deep learning network, named DLS-TS-Net, to model stock returns and address this challenge. We apply DLS-TS-Net in multivariate time series forecasting. The network integrates a Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM) units, and Gated Recurrent Units (GRUs). DLS-TS-Net overcomes LSTM's insensitivity to linear components in stock market forecasting by incorporating a traditional autoregressive model. Experimental results demonstrate that DLS-TS-Net excels at capturing long-term trends in multivariate factors and short-term fluctuations in the stock market, outperforming traditional time series and machine learning models. Additionally, when combined with the investment strategies proposed in this paper, DLS-TS-Net shows superior performance in managing risk during extreme events
{"title":"Deep neural network modeling for financial time series analysis","authors":"Zheng Fang , Toby Cai","doi":"10.1016/j.bdr.2025.100553","DOIUrl":"10.1016/j.bdr.2025.100553","url":null,"abstract":"<div><div>Modeling stock returns has often relied on multivariate time series analysis, and constructing an accurate model remains a challenging goal for both market investors and academic researchers. Stock return prediction typically involves multiple variables and a combination of long-term and short-term time series patterns. In this paper, we propose a new deep learning network, named DLS-TS-Net, to model stock returns and address this challenge. We apply DLS-TS-Net in multivariate time series forecasting. The network integrates a Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM) units, and Gated Recurrent Units (GRUs). DLS-TS-Net overcomes LSTM's insensitivity to linear components in stock market forecasting by incorporating a traditional autoregressive model. Experimental results demonstrate that DLS-TS-Net excels at capturing long-term trends in multivariate factors and short-term fluctuations in the stock market, outperforming traditional time series and machine learning models. Additionally, when combined with the investment strategies proposed in this paper, DLS-TS-Net shows superior performance in managing risk during extreme events</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100553"},"PeriodicalIF":3.5,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144263987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-28Epub Date: 2025-05-15DOI: 10.1016/j.bdr.2025.100539
Cristian Usala, Isabella Sulis, Mariano Porcu
This study investigates the determinants of tertiary education success in Italy, focusing on students' outcomes between the first and second years. We use population data of students enrolled between 2015 and 2019, integrating information on high school environments and degree program characteristics. This rich dataset has been exploited with a two-step approach: the first step defines indicators for high school quality and degree program difficulty; the second estimates a multinomial logit to assess the determinants of students' probability of being classified as regulars, churners, at risk of dropout, and dropouts. Data regarding the 2019 cohort have been further investigated by exploiting the additional information on students' socioeconomic backgrounds and schools' self-assessed effectiveness evaluations. Results indicate that students' high school backgrounds, socioeconomic conditions, and post-graduation prospects in terms of net wages and occupation rates of graduates in the chosen degree program significantly influence academic success and students' academic persistence. Overall, the results offer a comprehensive view of the determinants of university success, with specific patterns observed across the different student categories.
{"title":"Exploring the impact of high schools, socioeconomic factors, and degree programs on higher education success in Italy","authors":"Cristian Usala, Isabella Sulis, Mariano Porcu","doi":"10.1016/j.bdr.2025.100539","DOIUrl":"10.1016/j.bdr.2025.100539","url":null,"abstract":"<div><div>This study investigates the determinants of tertiary education success in Italy, focusing on students' outcomes between the first and second years. We use population data of students enrolled between 2015 and 2019, integrating information on high school environments and degree program characteristics. This rich dataset has been exploited with a two-step approach: the first step defines indicators for high school quality and degree program difficulty; the second estimates a multinomial logit to assess the determinants of students' probability of being classified as regulars, churners, at risk of dropout, and dropouts. Data regarding the 2019 cohort have been further investigated by exploiting the additional information on students' socioeconomic backgrounds and schools' self-assessed effectiveness evaluations. Results indicate that students' high school backgrounds, socioeconomic conditions, and post-graduation prospects in terms of net wages and occupation rates of graduates in the chosen degree program significantly influence academic success and students' academic persistence. Overall, the results offer a comprehensive view of the determinants of university success, with specific patterns observed across the different student categories.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100539"},"PeriodicalIF":3.5,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144134568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-28Epub Date: 2025-05-23DOI: 10.1016/j.bdr.2025.100543
Chiara Ardito , Roberto Leombruni , Giuseppe Costa , Angelo d’Errico
The relationship between age at retirement and subsequent physical health appears still contradictory in the literature, with more recent studies suggesting possible adverse health effects linked to employment at later ages. Aim of this study was to assess the long-term risk of overall mortality and incidence of cardiovascular diseases (CVDs) associated with age at retirement in three large Italian cohorts using both survey and administrative data.
The risk of mortality and CVDs associated with age at retirement, kept continuous, was assessed separately for gender using age-adjusted Cox models, further controlled for chronic morbidity, education, socioeconomic and previous working characteristics. In another analysis, age at retirement was examined treating it as a dichotomous variable, comparing, in a set of analyses with age at retirement from 52 to 65 years, the incidence of the health outcomes among subjects who retired after a certain age, compared to those who retired up to that age.
Higher age at retirement was associated with significantly higher mortality among men in the three cohorts, while among women the association was not significant, although in the same direction as for men. The risk of CVDs was also significantly associated with higher age at retirement in all the datasets among men, and in two of them among women. The set of the analyses on age at retirement dichotomized confirmed the results based on continuous age at retirement for both genders. Several robustness analyses, including IV Poisson instrumental variable, confirm the validity of results for men, whereas female results were less stable and robust.
Policy makers should be aware of the risk for public heath of policies that increase retirement age.
{"title":"Mortality and risk of cardiovascular diseases by age at retirement in three Italian cohorts","authors":"Chiara Ardito , Roberto Leombruni , Giuseppe Costa , Angelo d’Errico","doi":"10.1016/j.bdr.2025.100543","DOIUrl":"10.1016/j.bdr.2025.100543","url":null,"abstract":"<div><div>The relationship between age at retirement and subsequent physical health appears still contradictory in the literature, with more recent studies suggesting possible adverse health effects linked to employment at later ages. Aim of this study was to assess the long-term risk of overall mortality and incidence of cardiovascular diseases (CVDs) associated with age at retirement in three large Italian cohorts using both survey and administrative data.</div><div>The risk of mortality and CVDs associated with age at retirement, kept continuous, was assessed separately for gender using age-adjusted Cox models, further controlled for chronic morbidity, education, socioeconomic and previous working characteristics. In another analysis, age at retirement was examined treating it as a dichotomous variable, comparing, in a set of analyses with age at retirement from 52 to 65 years, the incidence of the health outcomes among subjects who retired after a certain age, compared to those who retired up to that age.</div><div>Higher age at retirement was associated with significantly higher mortality among men in the three cohorts, while among women the association was not significant, although in the same direction as for men. The risk of CVDs was also significantly associated with higher age at retirement in all the datasets among men, and in two of them among women. The set of the analyses on age at retirement dichotomized confirmed the results based on continuous age at retirement for both genders. Several robustness analyses, including IV Poisson instrumental variable, confirm the validity of results for men, whereas female results were less stable and robust.</div><div>Policy makers should be aware of the risk for public heath of policies that increase retirement age.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100543"},"PeriodicalIF":3.5,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144205025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-28Epub Date: 2025-05-16DOI: 10.1016/j.bdr.2025.100533
Alessio Bumbea , Andrea Mazzitelli , Giuseppe Espa , Alessandro Rinaldi
Innovative startups are the source of innovation and technological development; therefore, understanding their behavior can help better recognize the business organization's direction. This paper introduces a new method for clustering innovative startups using bipartite graph partitioning combined with spatial bootstrapping, improving clusters' accuracy and interpretability. Recent advancements in clustering techniques have introduced ensemble or consensus clustering methods, which aim to merge multiple clustering results into a superior outcome. A key challenge in this field is effectively integrating diverse clusters, and one promising solution involves utilizing graph formalism and partitioning strategies. By leveraging advanced graph partitioning techniques, we transform the task of partitioning the ensemble graph into a community detection problem. Our methodological approach improves the traditional method of bipartite graphs used in cluster ensembles by implementing the state of the art biLouvain algorithm. We also focused on techniques that could be used to increase the interpretability of the clusters themselves and how they can be used to obtain insightful information from the data. The proposed methodology was applied to a dataset of technologically advanced new businesses, located in the Lombardy region and recorded as innovative startups in the special section of the Italian Chambers of Commerce's Business Register.
{"title":"Bipartite graph partitioning and spatial bootstrapping methods: A case study of innovative startups","authors":"Alessio Bumbea , Andrea Mazzitelli , Giuseppe Espa , Alessandro Rinaldi","doi":"10.1016/j.bdr.2025.100533","DOIUrl":"10.1016/j.bdr.2025.100533","url":null,"abstract":"<div><div>Innovative startups are the source of innovation and technological development; therefore, understanding their behavior can help better recognize the business organization's direction. This paper introduces a new method for clustering innovative startups using bipartite graph partitioning combined with spatial bootstrapping, improving clusters' accuracy and interpretability. Recent advancements in clustering techniques have introduced ensemble or consensus clustering methods, which aim to merge multiple clustering results into a superior outcome. A key challenge in this field is effectively integrating diverse clusters, and one promising solution involves utilizing graph formalism and partitioning strategies. By leveraging advanced graph partitioning techniques, we transform the task of partitioning the ensemble graph into a community detection problem. Our methodological approach improves the traditional method of bipartite graphs used in cluster ensembles by implementing the state of the art biLouvain algorithm. We also focused on techniques that could be used to increase the interpretability of the clusters themselves and how they can be used to obtain insightful information from the data. The proposed methodology was applied to a dataset of technologically advanced new businesses, located in the Lombardy region and recorded as innovative startups in the special section of the Italian Chambers of Commerce's Business Register.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100533"},"PeriodicalIF":3.5,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144090526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In an era where digital technologies such as AI, cloud computing and IoT are reshaping global business dynamics, the digital transformation of enterprises has become a pivotal factor for maintaining competitive advantage. This paper provides an in-depth analysis of the digitalization process among Italian firms, leveraging data from the ISTAT ICT survey. Using a fuzzy set approach, we develop a refined index to measure technological deprivation across multiple dimensions, providing a detailed understanding of how digitalization is adopted at the firm level. The results indicate a moderate level of technological development among firms. The dimension related to online sales emerges as the most underdeveloped, highlighting it as a critical area for improvement for Italian companies and underscoring the need for targeted policy interventions to bridge these digital gaps. Moreover, the analysis reveals significant disparities across sectors, geographic areas, and firm sizes, with smaller enterprises and those in certain regions exhibiting lower levels of digital adoption. Our study underscores the utility of the fuzzy set methodology for analyzing high-dimensional big data and provides actionable insights for enhancing digital adoption among firms in Italy.
{"title":"Business digitalization in Italy: A comprehensive analysis using supplementary fuzzy set approach","authors":"Ilaria Benedetti, Federico Crescenzi, Tiziana Laureti, Niccolò Salvini","doi":"10.1016/j.bdr.2025.100538","DOIUrl":"10.1016/j.bdr.2025.100538","url":null,"abstract":"<div><div>In an era where digital technologies such as AI, cloud computing and IoT are reshaping global business dynamics, the digital transformation of enterprises has become a pivotal factor for maintaining competitive advantage. This paper provides an in-depth analysis of the digitalization process among Italian firms, leveraging data from the ISTAT ICT survey. Using a fuzzy set approach, we develop a refined index to measure technological deprivation across multiple dimensions, providing a detailed understanding of how digitalization is adopted at the firm level. The results indicate a moderate level of technological development among firms. The dimension related to online sales emerges as the most underdeveloped, highlighting it as a critical area for improvement for Italian companies and underscoring the need for targeted policy interventions to bridge these digital gaps. Moreover, the analysis reveals significant disparities across sectors, geographic areas, and firm sizes, with smaller enterprises and those in certain regions exhibiting lower levels of digital adoption. Our study underscores the utility of the fuzzy set methodology for analyzing high-dimensional big data and provides actionable insights for enhancing digital adoption among firms in Italy.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100538"},"PeriodicalIF":3.5,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144106841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-28Epub Date: 2025-08-20DOI: 10.1016/j.bdr.2025.100554
Keren Li , Wenqiang Zhang , Dandan Xiao , Peng Hou , Shuai Yan , Yang Wang , Xuerui Mao
To address the storage challenges stemming from large volumes of heterogeneous data in wind farms, we propose a data compression technique based on tensor train decomposition (TTD). Initially, we establish a tensor-based processing model to standardize the heterogeneous data originating from wind farms, which includes both structured SCADA (supervisory control and data acquisition) data and unstructured video and picture data. Subsequently, we introduce a TTD-based method designed to compress the heterogeneous data generated in wind farms while preserving the inherent spatial eigenstructure of the data. Finally, we validate the efficacy of the proposed method in alleviating data storage challenges by utilizing authentic wind farm datasets. Comparative analysis reveals that the TTD-based method outperforms previously proposed compression techniques, specifically the canonical polyadic (CP) and Tucker methods.
{"title":"Compression of big data collected in wind farm based on tensor train decomposition","authors":"Keren Li , Wenqiang Zhang , Dandan Xiao , Peng Hou , Shuai Yan , Yang Wang , Xuerui Mao","doi":"10.1016/j.bdr.2025.100554","DOIUrl":"10.1016/j.bdr.2025.100554","url":null,"abstract":"<div><div>To address the storage challenges stemming from large volumes of heterogeneous data in wind farms, we propose a data compression technique based on tensor train decomposition (TTD). Initially, we establish a tensor-based processing model to standardize the heterogeneous data originating from wind farms, which includes both structured SCADA (supervisory control and data acquisition) data and unstructured video and picture data. Subsequently, we introduce a TTD-based method designed to compress the heterogeneous data generated in wind farms while preserving the inherent spatial eigenstructure of the data. Finally, we validate the efficacy of the proposed method in alleviating data storage challenges by utilizing authentic wind farm datasets. Comparative analysis reveals that the TTD-based method outperforms previously proposed compression techniques, specifically the canonical polyadic (CP) and Tucker methods.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100554"},"PeriodicalIF":4.2,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144886090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-28Epub Date: 2025-06-08DOI: 10.1016/j.bdr.2025.100552
Jiachen Ma , Nazmus Sakib , Fahim Islam Anik , Sheikh Iqbal Ahamed
While temporal sentiment labels prove invaluable for video tagging, segmentation, and labeling tasks in multimedia studies, large-scale manual annotation remains cost and time-prohibitive. Emerging Online Time-Sync Comment (TSC) datasets offer promising alternatives for generating sentiment maps. However, limitations in existing TSC scope and a lack of resource-constrained data creation guidelines hinder broader use. This study addresses these challenges by proposing a novel system for automated TSC generation utilizing recent YouTube comments as a readily accessible source of time-synchronized data. The efficacy of our multi-platform data mining system is evaluated through extensive long-term trials, leading to the development and analysis of two large-scale TSC datasets. Benchmarking against original temporal Automatic Speech Recognition (ASR) sentiment annotations validates the accuracy of our generated data. This work establishes a promising method for automatic TSC generation, laying the groundwork for further advancements in multimedia research and paving the way for novel sentiment analysis applications.
{"title":"Time-synchronized sentiment labeling via autonomous online comments data mining: A multimodal information fusion on large-scale multimedia data","authors":"Jiachen Ma , Nazmus Sakib , Fahim Islam Anik , Sheikh Iqbal Ahamed","doi":"10.1016/j.bdr.2025.100552","DOIUrl":"10.1016/j.bdr.2025.100552","url":null,"abstract":"<div><div>While temporal sentiment labels prove invaluable for video tagging, segmentation, and labeling tasks in multimedia studies, large-scale manual annotation remains cost and time-prohibitive. Emerging Online Time-Sync Comment (TSC) datasets offer promising alternatives for generating sentiment maps. However, limitations in existing TSC scope and a lack of resource-constrained data creation guidelines hinder broader use. This study addresses these challenges by proposing a novel system for automated TSC generation utilizing recent YouTube comments as a readily accessible source of time-synchronized data. The efficacy of our multi-platform data mining system is evaluated through extensive long-term trials, leading to the development and analysis of two large-scale TSC datasets. Benchmarking against original temporal Automatic Speech Recognition (ASR) sentiment annotations validates the accuracy of our generated data. This work establishes a promising method for automatic TSC generation, laying the groundwork for further advancements in multimedia research and paving the way for novel sentiment analysis applications.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100552"},"PeriodicalIF":3.5,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144307271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-28Epub Date: 2025-05-17DOI: 10.1016/j.bdr.2025.100535
Yunting Liu, Yirong Huang
Unimodal sentiment analysis often fails to capture the complexity of financial sentiment. This paper proposes a multimodal deep learning framework that integrates text, audio, and image data from CCTV news videos on TikTok to construct a multimodal sentiment indicator for the Chinese stock market. Empirical results show that multimodal fusion enhances sentiment analysis, with text outperforming audio and image modalities. The indicator correlates weakly with stock returns but significantly with market volatility, aligns with seasonal sentiment patterns, and reflects significant events like COVID-19. Additionally, weekly sentiment trends indicate the lowest sentiment on Thursdays and the highest on Fridays. This study advances financial sentiment analysis by demonstrating the efficacy of multimodal indicators in capturing market sentiment and informing volatility forecasts.
{"title":"A multimodal deep learning framework for constructing a market sentiment index from stock news","authors":"Yunting Liu, Yirong Huang","doi":"10.1016/j.bdr.2025.100535","DOIUrl":"10.1016/j.bdr.2025.100535","url":null,"abstract":"<div><div>Unimodal sentiment analysis often fails to capture the complexity of financial sentiment. This paper proposes a multimodal deep learning framework that integrates text, audio, and image data from CCTV news videos on TikTok to construct a multimodal sentiment indicator for the Chinese stock market. Empirical results show that multimodal fusion enhances sentiment analysis, with text outperforming audio and image modalities. The indicator correlates weakly with stock returns but significantly with market volatility, aligns with seasonal sentiment patterns, and reflects significant events like COVID-19. Additionally, weekly sentiment trends indicate the lowest sentiment on Thursdays and the highest on Fridays. This study advances financial sentiment analysis by demonstrating the efficacy of multimodal indicators in capturing market sentiment and informing volatility forecasts.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100535"},"PeriodicalIF":3.5,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144147766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-05-28Epub Date: 2025-03-07DOI: 10.1016/j.bdr.2025.100519
Salheddine Kabou , Laid Gasmi , Abdelbaset Kabou , Sidi Mohammed Benslimane
One of the critical challenges in the big data analytics is the individual's privacy issues. Data anonymization models including k-anonymity and l-diversity are used to guarantee the tradeoff between privacy and data utility while publishing the data. However, these models focus only on the single release of datasets and produce a certain level of privacy. In practical big data applications, data publishing is more complicated where the data is published continuously as new data is collected, and the privacy should be achieved for different releases. In this research, we propose a new distributed bottom up approach on Apache Spark for achievement of the m-invariance privacy model in the continuous big data context. The proposed approach, which is the first study that deals with dynamic big data publishing, is based on the insertion and the split process. In the first process, the data records collected from different workers are inserted into an improved bottom up R-tree generalization in order to minimizing the information loss. The second process concentrates on splitting the overflowed node with respect to the m-invariance model requirement by minimizing the overlap between the resulting partitions. The experimental results show significant improvement in term of data utility, execution time and counterfeit data records as compared to existing techniques in the literature.
{"title":"ImDMI: Improved Distributed M-Invariance model to achieve privacy continuous big data publishing using Apache Spark","authors":"Salheddine Kabou , Laid Gasmi , Abdelbaset Kabou , Sidi Mohammed Benslimane","doi":"10.1016/j.bdr.2025.100519","DOIUrl":"10.1016/j.bdr.2025.100519","url":null,"abstract":"<div><div>One of the critical challenges in the big data analytics is the individual's privacy issues. Data anonymization models including k-anonymity and l-diversity are used to guarantee the tradeoff between privacy and data utility while publishing the data. However, these models focus only on the single release of datasets and produce a certain level of privacy. In practical big data applications, data publishing is more complicated where the data is published continuously as new data is collected, and the privacy should be achieved for different releases. In this research, we propose a new distributed bottom up approach on Apache Spark for achievement of the m-invariance privacy model in the continuous big data context. The proposed approach, which is the first study that deals with dynamic big data publishing, is based on the insertion and the split process. In the first process, the data records collected from different workers are inserted into an improved bottom up R-tree generalization in order to minimizing the information loss. The second process concentrates on splitting the overflowed node with respect to the m-invariance model requirement by minimizing the overlap between the resulting partitions. The experimental results show significant improvement in term of data utility, execution time and counterfeit data records as compared to existing techniques in the literature.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100519"},"PeriodicalIF":3.5,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143609162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-05-28Epub Date: 2025-02-26DOI: 10.1016/j.bdr.2025.100518
Angela Maria D'Uggento, Marta Biancardi, Domenico Ciriello
In the ever-changing landscape of financial markets, accurate option pricing remains critical for investors, traders and financial institutions. Traditionally, the Black-Scholes (B&S) model has been the cornerstone for option pricing, providing a solid framework based on mathematical and physical principles. Nevertheless, the B&S model has some limitations, such as the restriction to European options, the absence of dividends, constant volatility, etc. Studies and academic literature on the application of machine learning models in the financial sector are rapidly increasing. The main objective of this paper is to provide a comprehensive comparative analysis between the traditional B&S model and the most commonly used machine learning algorithms such as Artificial Neural Networks (ANNs). The rationale is twofold. First, to examine the assumptions of the B&S model, such as constant volatility and a perfectly efficient market, in light of the complexity of the real world, even though it is recognized that the model has been known as a pillar for decades. Secondly, to emphasize that the proliferation of big data and advances in computing power have fuelled the rise of machine learning techniques in finance. These algorithms have remarkable capabilities in discovering non-linear patterns and extracting information from large data sets, providing a compelling alternative to traditional quantitative methods. Machine learning offers a new way to capture and model such complex financial dynamics, which can lead to more accurate pricing models. By comparing the B&S model and some machine learning approaches, this paper aims to shed light on their respective strengths, weaknesses and applicability in the context of options pricing using real data. Through rigorous empirical analyses and performance metrics, our results demonstrate the importance of using machine learning techniques that can outperform or complement the established B&S model in predicting option prices by achieving higher prediction accuracy.
{"title":"Predicting option prices: From the Black-Scholes model to machine learning methods","authors":"Angela Maria D'Uggento, Marta Biancardi, Domenico Ciriello","doi":"10.1016/j.bdr.2025.100518","DOIUrl":"10.1016/j.bdr.2025.100518","url":null,"abstract":"<div><div>In the ever-changing landscape of financial markets, accurate option pricing remains critical for investors, traders and financial institutions. Traditionally, the Black-Scholes (B&S) model has been the cornerstone for option pricing, providing a solid framework based on mathematical and physical principles. Nevertheless, the B&S model has some limitations, such as the restriction to European options, the absence of dividends, constant volatility, etc. Studies and academic literature on the application of machine learning models in the financial sector are rapidly increasing. The main objective of this paper is to provide a comprehensive comparative analysis between the traditional B&S model and the most commonly used machine learning algorithms such as Artificial Neural Networks (ANNs). The rationale is twofold. First, to examine the assumptions of the B&S model, such as constant volatility and a perfectly efficient market, in light of the complexity of the real world, even though it is recognized that the model has been known as a pillar for decades. Secondly, to emphasize that the proliferation of big data and advances in computing power have fuelled the rise of machine learning techniques in finance. These algorithms have remarkable capabilities in discovering non-linear patterns and extracting information from large data sets, providing a compelling alternative to traditional quantitative methods. Machine learning offers a new way to capture and model such complex financial dynamics, which can lead to more accurate pricing models. By comparing the B&S model and some machine learning approaches, this paper aims to shed light on their respective strengths, weaknesses and applicability in the context of options pricing using real data. Through rigorous empirical analyses and performance metrics, our results demonstrate the importance of using machine learning techniques that can outperform or complement the established B&S model in predicting option prices by achieving higher prediction accuracy.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100518"},"PeriodicalIF":3.5,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143520058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}