Pub Date : 2025-06-08DOI: 10.1016/j.bdr.2025.100552
Jiachen Ma , Nazmus Sakib , Fahim Islam Anik , Sheikh Iqbal Ahamed
While temporal sentiment labels prove invaluable for video tagging, segmentation, and labeling tasks in multimedia studies, large-scale manual annotation remains cost and time-prohibitive. Emerging Online Time-Sync Comment (TSC) datasets offer promising alternatives for generating sentiment maps. However, limitations in existing TSC scope and a lack of resource-constrained data creation guidelines hinder broader use. This study addresses these challenges by proposing a novel system for automated TSC generation utilizing recent YouTube comments as a readily accessible source of time-synchronized data. The efficacy of our multi-platform data mining system is evaluated through extensive long-term trials, leading to the development and analysis of two large-scale TSC datasets. Benchmarking against original temporal Automatic Speech Recognition (ASR) sentiment annotations validates the accuracy of our generated data. This work establishes a promising method for automatic TSC generation, laying the groundwork for further advancements in multimedia research and paving the way for novel sentiment analysis applications.
{"title":"Time-synchronized sentiment labeling via autonomous online comments data mining: A multimodal information fusion on large-scale multimedia data","authors":"Jiachen Ma , Nazmus Sakib , Fahim Islam Anik , Sheikh Iqbal Ahamed","doi":"10.1016/j.bdr.2025.100552","DOIUrl":"10.1016/j.bdr.2025.100552","url":null,"abstract":"<div><div>While temporal sentiment labels prove invaluable for video tagging, segmentation, and labeling tasks in multimedia studies, large-scale manual annotation remains cost and time-prohibitive. Emerging Online Time-Sync Comment (TSC) datasets offer promising alternatives for generating sentiment maps. However, limitations in existing TSC scope and a lack of resource-constrained data creation guidelines hinder broader use. This study addresses these challenges by proposing a novel system for automated TSC generation utilizing recent YouTube comments as a readily accessible source of time-synchronized data. The efficacy of our multi-platform data mining system is evaluated through extensive long-term trials, leading to the development and analysis of two large-scale TSC datasets. Benchmarking against original temporal Automatic Speech Recognition (ASR) sentiment annotations validates the accuracy of our generated data. This work establishes a promising method for automatic TSC generation, laying the groundwork for further advancements in multimedia research and paving the way for novel sentiment analysis applications.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100552"},"PeriodicalIF":3.5,"publicationDate":"2025-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144307271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-07DOI: 10.1016/j.bdr.2025.100550
Samuele Cesarini, Fabrizio Antolini, Ivan Terraglia
This paper presents the development of an integrated data system tailored for the Italian regions, combining microdata from the Bank of Italy's and ISTAT's surveys. These datasets offer an in-depth analysis of both domestic and international aspects of tourism, framed within the theoretical context of the tourism determinants. By merging this integrated dataset with additional data from other statistical sources, this study offers a queryable relational database enabling granular regional analysis. Currently, tourism statistics in Italy are fragmented and do not provide a unified picture of tourism in its many aspects. The relational model's interoperability addresses Italy's fragmented tourism data landscape, and its data definition language represents an important step towards the creation of a unified tourism archive. Micro-data allows for different statistical analyses than those usually carried out with aggregated data, increasing knowledge of the dynamics of the sector.
{"title":"Development of an integrated data system for regional tourism analysis in Italy: A microdata perspective","authors":"Samuele Cesarini, Fabrizio Antolini, Ivan Terraglia","doi":"10.1016/j.bdr.2025.100550","DOIUrl":"10.1016/j.bdr.2025.100550","url":null,"abstract":"<div><div>This paper presents the development of an integrated data system tailored for the Italian regions, combining microdata from the Bank of Italy's and ISTAT's surveys. These datasets offer an in-depth analysis of both domestic and international aspects of tourism, framed within the theoretical context of the tourism determinants. By merging this integrated dataset with additional data from other statistical sources, this study offers a queryable relational database enabling granular regional analysis. Currently, tourism statistics in Italy are fragmented and do not provide a unified picture of tourism in its many aspects. The relational model's interoperability addresses Italy's fragmented tourism data landscape, and its data definition language represents an important step towards the creation of a unified tourism archive. Micro-data allows for different statistical analyses than those usually carried out with aggregated data, increasing knowledge of the dynamics of the sector.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100550"},"PeriodicalIF":3.5,"publicationDate":"2025-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144272198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-06DOI: 10.1016/j.bdr.2025.100551
Yang Liu , Xiaotang Zhou , Zhenwei Zhang , Xiran Yang
The application of topic models and pre-trained BERT is becoming increasingly widespread in Natural Language Processing (NLP), but there is no standard method for incorporating them. In this paper, we propose a new pre-trained BERT-guided Embedding-based Topic Model (BETM). Through constraints on the topic-word distribution and document-topic distributions, BETM can ingeniously learn semantic information, syntactic information and topic information from BERT embeddings. In addition, we design two solutions to improve the problem of insufficient contextual information caused by short input and the issue of semantic truncation caused by long put in BETM. We find that word embeddings of BETM are more suitable for topic modeling than pre-trained GloVe word embeddings, and BETM can flexibly select different variants of the pre-trained BERT for specific datasets to obtain better topic quality. And we find that BETM is good at handling large and heavy-tailed vocabularies even if it contains stop words. BETM obtained the State-Of-The-Art (SOTA) on several benchmark datasets - Yelp Review Polarity (106,586 samplest), Wiki Text 103 (71,533 samples), Open-Web-Text (35,713 samples), 20Newsgroups (10,899 samples), and AG-news (127,588 samples).
主题模型和预训练BERT在自然语言处理(NLP)中的应用越来越广泛,但目前还没有一个标准的方法来整合它们。本文提出了一种新的预训练bert引导的基于嵌入的主题模型(BETM)。通过对主题-词分布和文档-主题分布的约束,BETM可以巧妙地从BERT嵌入中学习语义信息、句法信息和主题信息。此外,针对BETM中短输入导致的上下文信息不足和长输入导致的语义截断问题,我们设计了两种解决方案。我们发现BETM的词嵌入比预训练好的GloVe词嵌入更适合于主题建模,并且BETM可以针对特定数据集灵活地选择预训练BERT的不同变体,从而获得更好的主题质量。我们发现,即使包含停止词,BETM也能很好地处理大而重尾的词汇。BETM在几个基准数据集上获得了最先进的(SOTA) - Yelp Review Polarity(106,586个样本),Wiki Text 103(71,533个样本),Open-Web-Text(35,713个样本),20Newsgroups(10,899个样本)和AG-news(127,588个样本)。
{"title":"BETM: A new pre-trained BERT-guided embedding-based topic model","authors":"Yang Liu , Xiaotang Zhou , Zhenwei Zhang , Xiran Yang","doi":"10.1016/j.bdr.2025.100551","DOIUrl":"10.1016/j.bdr.2025.100551","url":null,"abstract":"<div><div>The application of topic models and pre-trained BERT is becoming increasingly widespread in Natural Language Processing (NLP), but there is no standard method for incorporating them. In this paper, we propose a new pre-trained BERT-guided Embedding-based Topic Model (BETM). Through constraints on the topic-word distribution and document-topic distributions, BETM can ingeniously learn semantic information, syntactic information and topic information from BERT embeddings. In addition, we design two solutions to improve the problem of insufficient contextual information caused by short input and the issue of semantic truncation caused by long put in BETM. We find that word embeddings of BETM are more suitable for topic modeling than pre-trained GloVe word embeddings, and BETM can flexibly select different variants of the pre-trained BERT for specific datasets to obtain better topic quality. And we find that BETM is good at handling large and heavy-tailed vocabularies even if it contains stop words. BETM obtained the State-Of-The-Art (SOTA) on several benchmark datasets - Yelp Review Polarity (106,586 samplest), Wiki Text 103 (71,533 samples), Open-Web-Text (35,713 samples), 20Newsgroups (10,899 samples), and AG-news (127,588 samples).</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100551"},"PeriodicalIF":3.5,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144270762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-05-23DOI: 10.1016/j.bdr.2025.100537
Alessandro Magrini
The development of models for bankruptcy risk prediction has gained much attention in recent years due to the great availability of financial statement data. Most existing predictive models rely on financial ratios, which are performance-based measures expressing the relative magnitude of two accounting items. Despite the popularity of financial ratios, their use is notoriously accompanied by serious practical drawbacks, like the occurrence of outliers and redundancy, making data preprocessing necessary to avoid computational problems and obtain a good predictive accuracy. Isometric log ratios can potentially overcome these problems because they are designed to represent compositional data efficiently and have a logarithmic form that limits the occurrence of outliers. However, although they are not novel in the analysis of financial statements, no study has ever employed them to predict bankruptcy. In this article, we show the effectiveness of isometric log ratios to detect bankruptcy events in a sample of 138,720 Italian firms (127,420 active and 11,300 bankrupted) belonging to different industries and with different size and age. For this purpose, we use logistic regression with adaptive LASSO regularization and random forests to construct several predictive models featuring either financial ratios or isometric log ratios, and combining different horizons and lag structures. The results show that a set of 8 isometric log ratios provides, without preprocessing, almost the same predictive accuracy as a selection of 16 financial ratios that requires dropping 3.6% of the data. Also, the adaptive LASSO regularization reveals that redundancy for isometric log ratios is always below 20%, and in some cases near 0%, while it ranges from 12.5% to 46.9% for financial ratios. The predictive accuracy of models based on logistic regression is in line with and even higher than the one reported by recent studies, and random forests achieve a gain in the area under the Receiver Operating Characteristic (ROC) curve ranging between two and three percentage points.
{"title":"Bankruptcy risk prediction: A new approach based on compositional analysis of financial statements","authors":"Alessandro Magrini","doi":"10.1016/j.bdr.2025.100537","DOIUrl":"10.1016/j.bdr.2025.100537","url":null,"abstract":"<div><div>The development of models for bankruptcy risk prediction has gained much attention in recent years due to the great availability of financial statement data. Most existing predictive models rely on financial ratios, which are performance-based measures expressing the relative magnitude of two accounting items. Despite the popularity of financial ratios, their use is notoriously accompanied by serious practical drawbacks, like the occurrence of outliers and redundancy, making data preprocessing necessary to avoid computational problems and obtain a good predictive accuracy. Isometric log ratios can potentially overcome these problems because they are designed to represent compositional data efficiently and have a logarithmic form that limits the occurrence of outliers. However, although they are not novel in the analysis of financial statements, no study has ever employed them to predict bankruptcy. In this article, we show the effectiveness of isometric log ratios to detect bankruptcy events in a sample of 138,720 Italian firms (127,420 active and 11,300 bankrupted) belonging to different industries and with different size and age. For this purpose, we use logistic regression with adaptive LASSO regularization and random forests to construct several predictive models featuring either financial ratios or isometric log ratios, and combining different horizons and lag structures. The results show that a set of 8 isometric log ratios provides, without preprocessing, almost the same predictive accuracy as a selection of 16 financial ratios that requires dropping 3.6% of the data. Also, the adaptive LASSO regularization reveals that redundancy for isometric log ratios is always below 20%, and in some cases near 0%, while it ranges from 12.5% to 46.9% for financial ratios. The predictive accuracy of models based on logistic regression is in line with and even higher than the one reported by recent studies, and random forests achieve a gain in the area under the Receiver Operating Characteristic (ROC) curve ranging between two and three percentage points.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100537"},"PeriodicalIF":3.5,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144138321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-05-23DOI: 10.1016/j.bdr.2025.100543
Chiara Ardito , Roberto Leombruni , Giuseppe Costa , Angelo d’Errico
The relationship between age at retirement and subsequent physical health appears still contradictory in the literature, with more recent studies suggesting possible adverse health effects linked to employment at later ages. Aim of this study was to assess the long-term risk of overall mortality and incidence of cardiovascular diseases (CVDs) associated with age at retirement in three large Italian cohorts using both survey and administrative data.
The risk of mortality and CVDs associated with age at retirement, kept continuous, was assessed separately for gender using age-adjusted Cox models, further controlled for chronic morbidity, education, socioeconomic and previous working characteristics. In another analysis, age at retirement was examined treating it as a dichotomous variable, comparing, in a set of analyses with age at retirement from 52 to 65 years, the incidence of the health outcomes among subjects who retired after a certain age, compared to those who retired up to that age.
Higher age at retirement was associated with significantly higher mortality among men in the three cohorts, while among women the association was not significant, although in the same direction as for men. The risk of CVDs was also significantly associated with higher age at retirement in all the datasets among men, and in two of them among women. The set of the analyses on age at retirement dichotomized confirmed the results based on continuous age at retirement for both genders. Several robustness analyses, including IV Poisson instrumental variable, confirm the validity of results for men, whereas female results were less stable and robust.
Policy makers should be aware of the risk for public heath of policies that increase retirement age.
{"title":"Mortality and risk of cardiovascular diseases by age at retirement in three Italian cohorts","authors":"Chiara Ardito , Roberto Leombruni , Giuseppe Costa , Angelo d’Errico","doi":"10.1016/j.bdr.2025.100543","DOIUrl":"10.1016/j.bdr.2025.100543","url":null,"abstract":"<div><div>The relationship between age at retirement and subsequent physical health appears still contradictory in the literature, with more recent studies suggesting possible adverse health effects linked to employment at later ages. Aim of this study was to assess the long-term risk of overall mortality and incidence of cardiovascular diseases (CVDs) associated with age at retirement in three large Italian cohorts using both survey and administrative data.</div><div>The risk of mortality and CVDs associated with age at retirement, kept continuous, was assessed separately for gender using age-adjusted Cox models, further controlled for chronic morbidity, education, socioeconomic and previous working characteristics. In another analysis, age at retirement was examined treating it as a dichotomous variable, comparing, in a set of analyses with age at retirement from 52 to 65 years, the incidence of the health outcomes among subjects who retired after a certain age, compared to those who retired up to that age.</div><div>Higher age at retirement was associated with significantly higher mortality among men in the three cohorts, while among women the association was not significant, although in the same direction as for men. The risk of CVDs was also significantly associated with higher age at retirement in all the datasets among men, and in two of them among women. The set of the analyses on age at retirement dichotomized confirmed the results based on continuous age at retirement for both genders. Several robustness analyses, including IV Poisson instrumental variable, confirm the validity of results for men, whereas female results were less stable and robust.</div><div>Policy makers should be aware of the risk for public heath of policies that increase retirement age.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100543"},"PeriodicalIF":3.5,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144205025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-05-19DOI: 10.1016/j.bdr.2025.100540
Xiaoyu Zhang , Ye Pan , Lilan Tu
From 2010 to 2023, this research utilizes daily closing exchange rate data for countries participating in the Belt and Road Initiative (BRI) as well as China’s import and export volumes with these countries. Taking the renminbi (RMB) as the base currency and the other BRI currencies as quote currencies, we employ the Autoregressive Distributed Lag (ARDL) model to propose an algorithm for constructing a temporal two-layer network, resulting in the exchange-rate-trade network composed of 14 subnetworks. Through an analysis of the network’s topological structure, we observe that 2013 marks a significant turning point, after which the network transitions from a decentralized to a more centralized form. To assess the annual impact of China’s exchange rate and trade from 2010 to 2023, we introduce a comprehensive index for identifying key nodes within the network. Our findings based on this index indicate that: (1) Lebanon, Kyrgyzstan, and other diverse countries and regions emerge as key nodes, demonstrating China’s close economic ties with these countries and reflecting the substantial influence of RMB internationalization; and (2) compared with other years, China’s exchange rate market exerts notably stronger influence on the trade market in 2018, 2021, 2022, and 2023.
{"title":"The influence of China's exchange rate market on the Belt and Road trade market: Based on temporal two-layer networks","authors":"Xiaoyu Zhang , Ye Pan , Lilan Tu","doi":"10.1016/j.bdr.2025.100540","DOIUrl":"10.1016/j.bdr.2025.100540","url":null,"abstract":"<div><div>From 2010 to 2023, this research utilizes daily closing exchange rate data for countries participating in the Belt and Road Initiative (BRI) as well as China’s import and export volumes with these countries. Taking the renminbi (RMB) as the base currency and the other BRI currencies as quote currencies, we employ the Autoregressive Distributed Lag (ARDL) model to propose an algorithm for constructing a temporal two-layer network, resulting in the exchange-rate-trade network composed of 14 subnetworks. Through an analysis of the network’s topological structure, we observe that 2013 marks a significant turning point, after which the network transitions from a decentralized to a more centralized form. To assess the annual impact of China’s exchange rate and trade from 2010 to 2023, we introduce a comprehensive index for identifying key nodes within the network. Our findings based on this index indicate that: (1) Lebanon, Kyrgyzstan, and other diverse countries and regions emerge as key nodes, demonstrating China’s close economic ties with these countries and reflecting the substantial influence of RMB internationalization; and (2) compared with other years, China’s exchange rate market exerts notably stronger influence on the trade market in 2018, 2021, 2022, and 2023.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100540"},"PeriodicalIF":3.5,"publicationDate":"2025-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144147764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we develop a multigroup hidden Markov model to tackle the issue of measurement error in multi-source data from different countries. We focus, in particular, on the measurement of employment mobility in the Netherlands and Italy using linked data from the Labour Force Survey and administrative sources. The measurement-error correction we apply reconciles differences between data sources and shows that cross-country differences in employment mobility are smaller than originally thought. Error-corrected estimates indicate that mobility from temporary to permanent employment has become, over time, larger in Italy than in the Netherlands, while mobility from non-employment to temporary employment has steadily been higher in the Netherlands than in Italy.
{"title":"A multiple-group hidden Markov model for multi-source data. Cross-country differences in employment mobility in the presence of measurement error","authors":"Roberta Varriale , Mauricio Garnier-Villarreal , Dimitris Pavlopoulos , Danila Filipponi","doi":"10.1016/j.bdr.2025.100527","DOIUrl":"10.1016/j.bdr.2025.100527","url":null,"abstract":"<div><div>In this paper, we develop a multigroup hidden Markov model to tackle the issue of measurement error in multi-source data from different countries. We focus, in particular, on the measurement of employment mobility in the Netherlands and Italy using linked data from the Labour Force Survey and administrative sources. The measurement-error correction we apply reconciles differences between data sources and shows that cross-country differences in employment mobility are smaller than originally thought. Error-corrected estimates indicate that mobility from temporary to permanent employment has become, over time, larger in Italy than in the Netherlands, while mobility from non-employment to temporary employment has steadily been higher in the Netherlands than in Italy.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100527"},"PeriodicalIF":3.5,"publicationDate":"2025-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144116532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-05-17DOI: 10.1016/j.bdr.2025.100535
Yunting Liu, Yirong Huang
Unimodal sentiment analysis often fails to capture the complexity of financial sentiment. This paper proposes a multimodal deep learning framework that integrates text, audio, and image data from CCTV news videos on TikTok to construct a multimodal sentiment indicator for the Chinese stock market. Empirical results show that multimodal fusion enhances sentiment analysis, with text outperforming audio and image modalities. The indicator correlates weakly with stock returns but significantly with market volatility, aligns with seasonal sentiment patterns, and reflects significant events like COVID-19. Additionally, weekly sentiment trends indicate the lowest sentiment on Thursdays and the highest on Fridays. This study advances financial sentiment analysis by demonstrating the efficacy of multimodal indicators in capturing market sentiment and informing volatility forecasts.
{"title":"A multimodal deep learning framework for constructing a market sentiment index from stock news","authors":"Yunting Liu, Yirong Huang","doi":"10.1016/j.bdr.2025.100535","DOIUrl":"10.1016/j.bdr.2025.100535","url":null,"abstract":"<div><div>Unimodal sentiment analysis often fails to capture the complexity of financial sentiment. This paper proposes a multimodal deep learning framework that integrates text, audio, and image data from CCTV news videos on TikTok to construct a multimodal sentiment indicator for the Chinese stock market. Empirical results show that multimodal fusion enhances sentiment analysis, with text outperforming audio and image modalities. The indicator correlates weakly with stock returns but significantly with market volatility, aligns with seasonal sentiment patterns, and reflects significant events like COVID-19. Additionally, weekly sentiment trends indicate the lowest sentiment on Thursdays and the highest on Fridays. This study advances financial sentiment analysis by demonstrating the efficacy of multimodal indicators in capturing market sentiment and informing volatility forecasts.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100535"},"PeriodicalIF":3.5,"publicationDate":"2025-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144147766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tourism sustainability is a complex and multidimensional construct, for which there is no shared definition in the literature. Consequently, there is no standard method for its measurement, and the adoption of sustainable practices often falls short of reached goals. Therefore, contributing to the definition of the concept of sustainable tourism is essential, both for policymakers and academics. In this vein, news media data can represent a key element through which to understand the debate about tourism sustainability. This research aims to exploit the potential of news texts to explore how sustainable tourism is conceived within specific cultural contexts. Focusing on the case study of Italy, we analysed how the concept of tourism sustainability is represented in Italian newspapers, extracting the topics discussed in relation to this theme. From a methodological point of view, we employed a network-based approach for topic extraction. Our study contributes to the literature on tourism sustainability by proposing an innovative method for extracting information from unstructured data sources, such as textual data, providing policymakers with insights about the narrative around this topic.
{"title":"The narrative on tourism sustainability in Italian news: A text mining approach","authors":"Carla Galluccio , Paola Beccherle , Alessandra Petrucci","doi":"10.1016/j.bdr.2025.100541","DOIUrl":"10.1016/j.bdr.2025.100541","url":null,"abstract":"<div><div>Tourism sustainability is a complex and multidimensional construct, for which there is no shared definition in the literature. Consequently, there is no standard method for its measurement, and the adoption of sustainable practices often falls short of reached goals. Therefore, contributing to the definition of the concept of sustainable tourism is essential, both for policymakers and academics. In this vein, news media data can represent a key element through which to understand the debate about tourism sustainability. This research aims to exploit the potential of news texts to explore how sustainable tourism is conceived within specific cultural contexts. Focusing on the case study of Italy, we analysed how the concept of tourism sustainability is represented in Italian newspapers, extracting the topics discussed in relation to this theme. From a methodological point of view, we employed a network-based approach for topic extraction. Our study contributes to the literature on tourism sustainability by proposing an innovative method for extracting information from unstructured data sources, such as textual data, providing policymakers with insights about the narrative around this topic.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100541"},"PeriodicalIF":3.5,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144090527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-05-16DOI: 10.1016/j.bdr.2025.100533
Alessio Bumbea , Andrea Mazzitelli , Giuseppe Espa , Alessandro Rinaldi
Innovative startups are the source of innovation and technological development; therefore, understanding their behavior can help better recognize the business organization's direction. This paper introduces a new method for clustering innovative startups using bipartite graph partitioning combined with spatial bootstrapping, improving clusters' accuracy and interpretability. Recent advancements in clustering techniques have introduced ensemble or consensus clustering methods, which aim to merge multiple clustering results into a superior outcome. A key challenge in this field is effectively integrating diverse clusters, and one promising solution involves utilizing graph formalism and partitioning strategies. By leveraging advanced graph partitioning techniques, we transform the task of partitioning the ensemble graph into a community detection problem. Our methodological approach improves the traditional method of bipartite graphs used in cluster ensembles by implementing the state of the art biLouvain algorithm. We also focused on techniques that could be used to increase the interpretability of the clusters themselves and how they can be used to obtain insightful information from the data. The proposed methodology was applied to a dataset of technologically advanced new businesses, located in the Lombardy region and recorded as innovative startups in the special section of the Italian Chambers of Commerce's Business Register.
{"title":"Bipartite graph partitioning and spatial bootstrapping methods: A case study of innovative startups","authors":"Alessio Bumbea , Andrea Mazzitelli , Giuseppe Espa , Alessandro Rinaldi","doi":"10.1016/j.bdr.2025.100533","DOIUrl":"10.1016/j.bdr.2025.100533","url":null,"abstract":"<div><div>Innovative startups are the source of innovation and technological development; therefore, understanding their behavior can help better recognize the business organization's direction. This paper introduces a new method for clustering innovative startups using bipartite graph partitioning combined with spatial bootstrapping, improving clusters' accuracy and interpretability. Recent advancements in clustering techniques have introduced ensemble or consensus clustering methods, which aim to merge multiple clustering results into a superior outcome. A key challenge in this field is effectively integrating diverse clusters, and one promising solution involves utilizing graph formalism and partitioning strategies. By leveraging advanced graph partitioning techniques, we transform the task of partitioning the ensemble graph into a community detection problem. Our methodological approach improves the traditional method of bipartite graphs used in cluster ensembles by implementing the state of the art biLouvain algorithm. We also focused on techniques that could be used to increase the interpretability of the clusters themselves and how they can be used to obtain insightful information from the data. The proposed methodology was applied to a dataset of technologically advanced new businesses, located in the Lombardy region and recorded as innovative startups in the special section of the Italian Chambers of Commerce's Business Register.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100533"},"PeriodicalIF":3.5,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144090526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}