Pub Date : 2025-04-14DOI: 10.1016/j.bdr.2025.100530
Shaobo Deng, Weili Yuan, Sujie Guan, Xing Lin, Zemin Liao, Min Li
Constructing an optimal decision tree remains a challenging task. Existing algorithms often utilize power coefficient methods or standardization techniques to weight the entropy value; however, these approaches do not sufficiently account for the importance of attributes. This paper introduces an Adaptive Entropy Decision Tree (EWDT) algorithm, which leverages eigenvalue importance and integrates singular value decomposition into the calculation of entropy values. Experimental results demonstrate that the proposed algorithm outperforms other decision tree algorithms in terms of accuracy, precision, recall, and F1-score.
{"title":"A decision tree algorithm based on adaptive entropy of feature value importance","authors":"Shaobo Deng, Weili Yuan, Sujie Guan, Xing Lin, Zemin Liao, Min Li","doi":"10.1016/j.bdr.2025.100530","DOIUrl":"10.1016/j.bdr.2025.100530","url":null,"abstract":"<div><div>Constructing an optimal decision tree remains a challenging task. Existing algorithms often utilize power coefficient methods or standardization techniques to weight the entropy value; however, these approaches do not sufficiently account for the importance of attributes. This paper introduces an Adaptive Entropy Decision Tree (EWDT) algorithm, which leverages eigenvalue importance and integrates singular value decomposition into the calculation of entropy values. Experimental results demonstrate that the proposed algorithm outperforms other decision tree algorithms in terms of accuracy, precision, recall, and F1-score.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100530"},"PeriodicalIF":3.5,"publicationDate":"2025-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143899918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-09DOI: 10.1016/j.bdr.2025.100528
Haitao He , Ke Liu , Lei Zhang , Ke Xu , Jiazheng Li , Jiadong Ren
With the development of network security research, intrusion detection systems based on deep learning show great potential in network attack detection. As crucial tools for ensuring network information security, these systems themselves are vulnerable to poisoning attacks from attackers. Currently, most poisoning attack defense methods cannot effectively utilize network traffic characteristics and are only effective for specific models, showing poor defense results for other models. Furthermore, detection of poisoning attacks is often overlooked, leading to a lack of timely and effective defense against such attacks. Therefore, we propose a data poisoning defense mechanism called TE-PADN. Firstly, we introduce a temporal margin sample generation algorithm that integrates an attention mechanism. Based on mapping the original data time series into a latent feature space, this algorithm learns the temporal characteristics of the data and focuses on information from different positions using the attention mechanism to generate temporal margin samples for repairing poisoned models. Secondly, we propose a multi-level poisoning attack detection method for real-time and accurate detection of undetected poisoning attacks. By employing ensemble learning methods, this approach enhances model robustness, repairs model classification boundaries that have shifted due to poisoning attacks and achieves efficient defense against poisoning attacks. Finally, experimental validation of our proposed method demonstrates promising results. Under a 10% attack intensity, the average accuracy of TE-PADN in recovering poisoning models increased by 6.5% on the NSL-KDD dataset, 5.3% on the UNSW-NB15 dataset, and 5.9% on the CICIDS2017 dataset.
{"title":"TE-PADN: A poisoning attack defense model based on temporal margin samples","authors":"Haitao He , Ke Liu , Lei Zhang , Ke Xu , Jiazheng Li , Jiadong Ren","doi":"10.1016/j.bdr.2025.100528","DOIUrl":"10.1016/j.bdr.2025.100528","url":null,"abstract":"<div><div>With the development of network security research, intrusion detection systems based on deep learning show great potential in network attack detection. As crucial tools for ensuring network information security, these systems themselves are vulnerable to poisoning attacks from attackers. Currently, most poisoning attack defense methods cannot effectively utilize network traffic characteristics and are only effective for specific models, showing poor defense results for other models. Furthermore, detection of poisoning attacks is often overlooked, leading to a lack of timely and effective defense against such attacks. Therefore, we propose a data poisoning defense mechanism called TE-PADN. Firstly, we introduce a temporal margin sample generation algorithm that integrates an attention mechanism. Based on mapping the original data time series into a latent feature space, this algorithm learns the temporal characteristics of the data and focuses on information from different positions using the attention mechanism to generate temporal margin samples for repairing poisoned models. Secondly, we propose a multi-level poisoning attack detection method for real-time and accurate detection of undetected poisoning attacks. By employing ensemble learning methods, this approach enhances model robustness, repairs model classification boundaries that have shifted due to poisoning attacks and achieves efficient defense against poisoning attacks. Finally, experimental validation of our proposed method demonstrates promising results. Under a 10% attack intensity, the average accuracy of TE-PADN in recovering poisoning models increased by 6.5% on the NSL-KDD dataset, 5.3% on the UNSW-NB15 dataset, and 5.9% on the CICIDS2017 dataset.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100528"},"PeriodicalIF":3.5,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143816452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-08DOI: 10.1016/j.bdr.2025.100529
Ehsan Ahmadi, Reza Maihami
The COVID-19 pandemic revealed significant limitations in traditional approaches to analyzing time-series data that use one-dimensional data such as historical infection rates. Such approaches do not capture the complex, multifactor influences on disease spread. This paper addresses these challenges by proposing a comprehensive methodology that integrates multiple data sources, including community mobility, census information, Google search trends, socioeconomic variables, vaccination coverage, and political data. In addition, this paper proposes a new cross-learning (CL) methodology that allows for the training of machine learning models on multiple related time series simultaneously, enabling more accurate and robust predictions. Applying the CL approach with four machine learning algorithms, we successfully forecasted confirmed COVID-19 cases 30 days in advance with greater accuracy than the traditional ARIMAX model and the newer Transformer deep learning technique. Our findings identified daily hospital admissions as a significant predictor at the state level and vaccination status at the national level. Random Forest with CL was very effective, performing best in 44 states, while ARIMAX outperformed in seven larger states. These findings highlight the importance of advanced predictive modeling in resource optimization and response strategy development for future health emergencies.
{"title":"Leveraging artificial intelligence for pandemic management: Case of COVID-19 in the United States","authors":"Ehsan Ahmadi, Reza Maihami","doi":"10.1016/j.bdr.2025.100529","DOIUrl":"10.1016/j.bdr.2025.100529","url":null,"abstract":"<div><div>The COVID-19 pandemic revealed significant limitations in traditional approaches to analyzing time-series data that use one-dimensional data such as historical infection rates. Such approaches do not capture the complex, multifactor influences on disease spread. This paper addresses these challenges by proposing a comprehensive methodology that integrates multiple data sources, including community mobility, census information, Google search trends, socioeconomic variables, vaccination coverage, and political data. In addition, this paper proposes a new cross-learning (CL) methodology that allows for the training of machine learning models on multiple related time series simultaneously, enabling more accurate and robust predictions. Applying the CL approach with four machine learning algorithms, we successfully forecasted confirmed COVID-19 cases 30 days in advance with greater accuracy than the traditional ARIMAX model and the newer Transformer deep learning technique. Our findings identified daily hospital admissions as a significant predictor at the state level and vaccination status at the national level. Random Forest with CL was very effective, performing best in 44 states, while ARIMAX outperformed in seven larger states. These findings highlight the importance of advanced predictive modeling in resource optimization and response strategy development for future health emergencies.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100529"},"PeriodicalIF":3.5,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143839334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Taken as pivotal in explaining settlement patterns, territorial and socioeconomic factors — such as elevation or proximity to water bodies or infrastructures — are evolving amid contemporary trends favouring urbanized areas. Urban centers, transformed over the past decades, attract younger populations because of the inherent proximity to services and infrastructure, amid challenges posed by urban living costs and housing availability. This study extends the Latitude, Altitude, Distance from the Sea, and Proximity to Major Cities (LADISC) model, integrating two additional geographic metrics to provide a refined framework for analyzing population distribution trends. Unlike traditional approaches that rely on administrative boundaries, this model applies geostatistical techniques to high-resolution census data, offering a detailed and dynamic perspective on settlement evolution in Italy. Advanced applications of official data mining with exploratory statistical techniques allow for the uncovering of a significant concentration of elderly populations within urban centers, underscoring the needed for tailored healthcare services and urban amenities. Conversely, we found that younger populations are decentralizing towards suburban areas, reflecting a sudden shift in preferences and mobility patterns. Such trends prompt a reassessment of urban planning and (sustainable) development strategies to accommodate diverse population needs. Our study further explores the impact of Covid-19 pandemic on population distribution, suggesting a potential surge in remote working and digital interactions that are most likely to reshape peri‑urban settlements. By refining the LADISC framework, this study presents an innovative methodology for handling large-scale census data, allowing for spatially explicit demographic analysis that captures population shifts more precisely than traditional methods.
{"title":"Settlement patterns, official statistics and geo-economic dynamics: Evidence from a LADISC approach to Italy","authors":"Gianluigi Salvucci , Luca Salvati , Leonardo Salvatore Alaimo , Ioannis Vardopoulos","doi":"10.1016/j.bdr.2025.100525","DOIUrl":"10.1016/j.bdr.2025.100525","url":null,"abstract":"<div><div>Taken as pivotal in explaining settlement patterns, territorial and socioeconomic factors — such as elevation or proximity to water bodies or infrastructures — are evolving amid contemporary trends favouring urbanized areas. Urban centers, transformed over the past decades, attract younger populations because of the inherent proximity to services and infrastructure, amid challenges posed by urban living costs and housing availability. This study extends the Latitude, Altitude, Distance from the Sea, and Proximity to Major Cities (LADISC) model, integrating two additional geographic metrics to provide a refined framework for analyzing population distribution trends. Unlike traditional approaches that rely on administrative boundaries, this model applies geostatistical techniques to high-resolution census data, offering a detailed and dynamic perspective on settlement evolution in Italy. Advanced applications of official data mining with exploratory statistical techniques allow for the uncovering of a significant concentration of elderly populations within urban centers, underscoring the needed for tailored healthcare services and urban amenities. Conversely, we found that younger populations are decentralizing towards suburban areas, reflecting a sudden shift in preferences and mobility patterns. Such trends prompt a reassessment of urban planning and (sustainable) development strategies to accommodate diverse population needs. Our study further explores the impact of Covid-19 pandemic on population distribution, suggesting a potential surge in remote working and digital interactions that are most likely to reshape peri‑urban settlements. By refining the LADISC framework, this study presents an innovative methodology for handling large-scale census data, allowing for spatially explicit demographic analysis that captures population shifts more precisely than traditional methods.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100525"},"PeriodicalIF":3.5,"publicationDate":"2025-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144068457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-28DOI: 10.1016/j.bdr.2025.100526
Laura Benedan , Cinzia Colapinto , Paolo Mariani , Laura Pagani , Mariangela Zenga
The present study examines the state of gender equality and inclusion in Italian life sciences companies. An ad hoc questionnaire was developed and distributed to human resources professionals from various firms with the objective of gathering insights on gender equality practices. Our primary data have been combined with available information from the AIDA database. This included information on the size of the companies in terms of the number of employees and sales revenues. To assess the degree of ' commitment to sustainability and gender equality, we analysed their websites. Three statistical indicators were constructed and combined into a practical synthetic index. This index may be used in future research to quantify and measure each company's overall propensity towards gender equality and inclusion.
{"title":"Women in life sciences firms: Gender diversity and roles indicator from data integration","authors":"Laura Benedan , Cinzia Colapinto , Paolo Mariani , Laura Pagani , Mariangela Zenga","doi":"10.1016/j.bdr.2025.100526","DOIUrl":"10.1016/j.bdr.2025.100526","url":null,"abstract":"<div><div>The present study examines the state of gender equality and inclusion in Italian life sciences companies. An ad hoc questionnaire was developed and distributed to human resources professionals from various firms with the objective of gathering insights on gender equality practices. Our primary data have been combined with available information from the AIDA database. This included information on the size of the companies in terms of the number of employees and sales revenues. To assess the degree of ' commitment to sustainability and gender equality, we analysed their websites. Three statistical indicators were constructed and combined into a practical synthetic index. This index may be used in future research to quantify and measure each company's overall propensity towards gender equality and inclusion.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100526"},"PeriodicalIF":3.5,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144068458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-28DOI: 10.1016/j.bdr.2025.100524
Atilla Karaahmetoğlu , Mehmet Yıldız , Erdem Ünal , Uğur Aydın , Murat Koraş , Barış Akgün
Banks rely on expert-generated features and simple models to have high performance and interpretability at the same time. Interpretability is needed for internal assessment and regulatory compliance for specific problems such as risk assessment and both expert generated features and simple models satisfy this need. However, feature generation by experts is a time-consuming process and susceptible to bias. In addition, features need to be generated fairly often due to the dynamic nature of bank data, and in case of significant changes or new data sources, expertise might take a while to build up. Complex models, such as deep neural networks, may be able to remedy this. However, interpretability/explainability approaches for complex models are not satisfactory from the banks' point of view. In addition, such models do not always work well with tabular data which is abundant in banking applications. This paper introduces an automated feature synthesis pipeline that creates informative and domain-interpretable features which iconsumes significantly less time than brute-force methods. We create novel feature synthesis steps, define elimination rules to rule out uninterpretable features, and combine performance-based feature selection methods to pick desirable ones to build our models. Our results on two different datasets show that the features generated with our pipeline; (1) perform on par or better than features generated by existing methods, (2) are obtained faster, and (3) are domain-interpretable.
{"title":"Efficient, interpretable and automated feature engineering for bank data","authors":"Atilla Karaahmetoğlu , Mehmet Yıldız , Erdem Ünal , Uğur Aydın , Murat Koraş , Barış Akgün","doi":"10.1016/j.bdr.2025.100524","DOIUrl":"10.1016/j.bdr.2025.100524","url":null,"abstract":"<div><div>Banks rely on expert-generated features and simple models to have high performance and interpretability at the same time. Interpretability is needed for internal assessment and regulatory compliance for specific problems such as risk assessment and both expert generated features and simple models satisfy this need. However, feature generation by experts is a time-consuming process and susceptible to bias. In addition, features need to be generated fairly often due to the dynamic nature of bank data, and in case of significant changes or new data sources, expertise might take a while to build up. Complex models, such as deep neural networks, may be able to remedy this. However, interpretability/explainability approaches for complex models are not satisfactory from the banks' point of view. In addition, such models do not always work well with tabular data which is abundant in banking applications. This paper introduces an automated feature synthesis pipeline that creates informative and domain-interpretable features which iconsumes significantly less time than brute-force methods. We create novel feature synthesis steps, define elimination rules to rule out uninterpretable features, and combine performance-based feature selection methods to pick desirable ones to build our models. Our results on two different datasets show that the features generated with our pipeline; (1) perform on par or better than features generated by existing methods, (2) are obtained faster, and (3) are domain-interpretable.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100524"},"PeriodicalIF":3.5,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143790985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-20DOI: 10.1016/j.bdr.2025.100523
Mohamed Mouhiha, Abdelfettah Mabrouk
There is a great challenge when building an efficient Big Data Warehouse (DW) from the traditional data warehouse which used to handle the large datasets. Several presented solutions concentrate on the conversion of a standard DW to an columnar model, especially for direct and traditional data sources. Though there have been many successful algorithms that apply data clustering methods, these approaches also come with their fair share of limitations. This paper provides a comprehensive review of the existing methods, both tuned and out-of-the box, exposing their strengths and weaknesses. Further, a comparative study of the different options is always conducted to compare and assess them.
{"title":"NoSQL data warehouse optimizing models: A comparative study of column-oriented approaches","authors":"Mohamed Mouhiha, Abdelfettah Mabrouk","doi":"10.1016/j.bdr.2025.100523","DOIUrl":"10.1016/j.bdr.2025.100523","url":null,"abstract":"<div><div>There is a great challenge when building an efficient Big Data Warehouse (DW) from the traditional data warehouse which used to handle the large datasets. Several presented solutions concentrate on the conversion of a standard DW to an columnar model, especially for direct and traditional data sources. Though there have been many successful algorithms that apply data clustering methods, these approaches also come with their fair share of limitations. This paper provides a comprehensive review of the existing methods, both tuned and out-of-the box, exposing their strengths and weaknesses. Further, a comparative study of the different options is always conducted to compare and assess them.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100523"},"PeriodicalIF":3.5,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143681953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-17DOI: 10.1016/j.bdr.2025.100522
Zhenzhen Yang, Xinyi Wu, Yongpeng Yang
Visible-infrared person re-identification (VI-ReID) is a challenging task due to significant differences between modalities and feature representation of visible and infrared images. The primary goal of current VI-ReID is to reduce discrepancies between modalities. However, existing research primarily focuses on learning modality-invariant features. Due to significant modality differences, it is challenging to learn an effectively common feature space. Moreover, the intra-modality differences have not been well addressed. Therefore, a novel multi-dimensional feature learning network (MFLNet) is proposed in this paper to tackle the inherent challenges of intra-modality and inter-modality differences in VI-ReID. Specifically, to effectively address intra-modality variations, we employ the random local shear (RLS) augmentation, which accurately simulates viewpoint and posture changes. This augmentation can be seamlessly incorporated into other methods without modifying the network or parameters. Additionally, we integrate the multi-dimensional information mining (MIM) module to extract discriminative features and bridge the gap between modalities. Moreover, the cyclical smoothing focal (CSF) loss is introduced to prioritize challenging samples during training, thereby enhancing the ReID performance. Finally, the experimental results indicate that the proposed MFLNet outperforms other VI-ReID approaches on the SYSU-MM01, RegDB and LLCM datasets.
{"title":"Multi-dimensional feature learning for visible-infrared person re-identification","authors":"Zhenzhen Yang, Xinyi Wu, Yongpeng Yang","doi":"10.1016/j.bdr.2025.100522","DOIUrl":"10.1016/j.bdr.2025.100522","url":null,"abstract":"<div><div>Visible-infrared person re-identification (VI-ReID) is a challenging task due to significant differences between modalities and feature representation of visible and infrared images. The primary goal of current VI-ReID is to reduce discrepancies between modalities. However, existing research primarily focuses on learning modality-invariant features. Due to significant modality differences, it is challenging to learn an effectively common feature space. Moreover, the intra-modality differences have not been well addressed. Therefore, a novel multi-dimensional feature learning network (MFLNet) is proposed in this paper to tackle the inherent challenges of intra-modality and inter-modality differences in VI-ReID. Specifically, to effectively address intra-modality variations, we employ the random local shear (RLS) augmentation, which accurately simulates viewpoint and posture changes. This augmentation can be seamlessly incorporated into other methods without modifying the network or parameters. Additionally, we integrate the multi-dimensional information mining (MIM) module to extract discriminative features and bridge the gap between modalities. Moreover, the cyclical smoothing focal (CSF) loss is introduced to prioritize challenging samples during training, thereby enhancing the ReID performance. Finally, the experimental results indicate that the proposed MFLNet outperforms other VI-ReID approaches on the SYSU-MM01, RegDB and LLCM datasets.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100522"},"PeriodicalIF":3.5,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143654669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-15DOI: 10.1016/j.bdr.2025.100521
Shivangi Gheewala , Shuxiang Xu , Soonja Yeom
Despite considerable research of utilizing deep learning technology and textual reviews in recommender systems, improving system performance is a contentious matter. This is primarily due to issues faced in learning user-item representations. One issue is the limited ability of networks to model dynamic user-item representations from reviews. Particularly, in sequence-to-sequence learning models, there appears a substantial likelihood of losing semantic knowledge of previous review sequences, as overridden by the next. Another issue lies in effectively integrating global-level and topical-level representations to extract informative content and enhance user-item representations. Existing methods struggle to maintain contextual consistency during this integration process, resulting in suboptimal representation learning, especially attempting to capture finer details. To address these issues, we propose a novel recommendation model called Deep Attention Dynamic Representation Learning (DADRL). Specifically, we employ Latent Dirichlet Allocation and dynamic modulator-based Long Short-Term Memory to extract topical and dynamic global representations. Then, we introduce an attentional fusion methodology to integrate these representations in a contextually consistent manner and construct informative attentional user-item representations. We use these representations into the factorization machines layer to predict the final scores. Experimental results on Amazon categories, Yelp, and LibraryThing show that our model exhibits superior performance compared to several state-of-the-arts. We further examine the DADRL architecture under various conditions to provide insights on the model's employed components.
{"title":"Deep attention dynamic representation learning networks for recommender system review modeling","authors":"Shivangi Gheewala , Shuxiang Xu , Soonja Yeom","doi":"10.1016/j.bdr.2025.100521","DOIUrl":"10.1016/j.bdr.2025.100521","url":null,"abstract":"<div><div>Despite considerable research of utilizing deep learning technology and textual reviews in recommender systems, improving system performance is a contentious matter. This is primarily due to issues faced in learning user-item representations. One issue is the limited ability of networks to model dynamic user-item representations from reviews. Particularly, in sequence-to-sequence learning models, there appears a substantial likelihood of losing semantic knowledge of previous review sequences, as overridden by the next. Another issue lies in effectively integrating global-level and topical-level representations to extract informative content and enhance user-item representations. Existing methods struggle to maintain contextual consistency during this integration process, resulting in suboptimal representation learning, especially attempting to capture finer details. To address these issues, we propose a novel recommendation model called Deep Attention Dynamic Representation Learning (DADRL). Specifically, we employ Latent Dirichlet Allocation and dynamic modulator-based Long Short-Term Memory to extract topical and dynamic global representations. Then, we introduce an attentional fusion methodology to integrate these representations in a contextually consistent manner and construct informative attentional user-item representations. We use these representations into the factorization machines layer to predict the final scores. Experimental results on Amazon categories, Yelp, and LibraryThing show that our model exhibits superior performance compared to several state-of-the-arts. We further examine the DADRL architecture under various conditions to provide insights on the model's employed components.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100521"},"PeriodicalIF":3.5,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143681952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-13DOI: 10.1016/j.bdr.2025.100520
Giovanni Angelini , Michele Costa , Andrea Guizzardi
This study examines pricing strategies and decision-making processes in the hospitality industry by analyzing “ask” prices on online travel agencies (i.e., the rates at which hoteliers are willing to sell their rooms). We face the challenge of modeling a continuous flow of big data organized as “time series of time series,” where daily seasonality and advance bookings intersect. Our research combines insights from tourism, quantitative methods, and big data to improve pricing strategies, contributing to both theory and practice in revenue management. Focusing on Venice, we analyze price competition as a multivariate stochastic process using a Structural Vector Autoregressive (SVAR) approach, aligning with modern dynamic pricing algorithms.
The findings show that time-based pricing strategies, which adjust based on the day of arrival and booking, are more important than room features in setting hotel prices. We also find that price changes have a non-linear and decreasing effect as the booking date approaches. These insights suggest that hotels could create more advanced pricing strategies, and policymakers should consider these factors when addressing the challenges related to overtourism.
We study the complex competitive relationships among heterogeneous service providers with an approach applicable to any market where consumption is delayed relative to purchase time. However, we highlight that the quality and accessibility of information in the tourism sector are key aspects to be considered when using big data in this industry.
{"title":"Complex data in tourism analysis: A stochastic approach to price competition","authors":"Giovanni Angelini , Michele Costa , Andrea Guizzardi","doi":"10.1016/j.bdr.2025.100520","DOIUrl":"10.1016/j.bdr.2025.100520","url":null,"abstract":"<div><div>This study examines pricing strategies and decision-making processes in the hospitality industry by analyzing “ask” prices on online travel agencies (i.e., the rates at which hoteliers are willing to sell their rooms). We face the challenge of modeling a continuous flow of big data organized as “time series of time series,” where daily seasonality and advance bookings intersect. Our research combines insights from tourism, quantitative methods, and big data to improve pricing strategies, contributing to both theory and practice in revenue management. Focusing on Venice, we analyze price competition as a multivariate stochastic process using a Structural Vector Autoregressive (SVAR) approach, aligning with modern dynamic pricing algorithms.</div><div>The findings show that time-based pricing strategies, which adjust based on the day of arrival and booking, are more important than room features in setting hotel prices. We also find that price changes have a non-linear and decreasing effect as the booking date approaches. These insights suggest that hotels could create more advanced pricing strategies, and policymakers should consider these factors when addressing the challenges related to overtourism.</div><div>We study the complex competitive relationships among heterogeneous service providers with an approach applicable to any market where consumption is delayed relative to purchase time. However, we highlight that the quality and accessibility of information in the tourism sector are key aspects to be considered when using big data in this industry.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100520"},"PeriodicalIF":3.5,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143641669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}