首页 > 最新文献

Big Data Research最新文献

英文 中文
A decision tree algorithm based on adaptive entropy of feature value importance 基于特征值重要度自适应熵的决策树算法
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-04-14 DOI: 10.1016/j.bdr.2025.100530
Shaobo Deng, Weili Yuan, Sujie Guan, Xing Lin, Zemin Liao, Min Li
Constructing an optimal decision tree remains a challenging task. Existing algorithms often utilize power coefficient methods or standardization techniques to weight the entropy value; however, these approaches do not sufficiently account for the importance of attributes. This paper introduces an Adaptive Entropy Decision Tree (EWDT) algorithm, which leverages eigenvalue importance and integrates singular value decomposition into the calculation of entropy values. Experimental results demonstrate that the proposed algorithm outperforms other decision tree algorithms in terms of accuracy, precision, recall, and F1-score.
构建最优决策树仍然是一项具有挑战性的任务。现有算法多采用功率系数法或标准化技术对熵值进行加权;然而,这些方法并没有充分考虑到属性的重要性。本文介绍了一种自适应熵决策树(EWDT)算法,该算法利用特征值重要度,将奇异值分解集成到熵值计算中。实验结果表明,该算法在准确率、精密度、召回率和f1分数方面都优于其他决策树算法。
{"title":"A decision tree algorithm based on adaptive entropy of feature value importance","authors":"Shaobo Deng,&nbsp;Weili Yuan,&nbsp;Sujie Guan,&nbsp;Xing Lin,&nbsp;Zemin Liao,&nbsp;Min Li","doi":"10.1016/j.bdr.2025.100530","DOIUrl":"10.1016/j.bdr.2025.100530","url":null,"abstract":"<div><div>Constructing an optimal decision tree remains a challenging task. Existing algorithms often utilize power coefficient methods or standardization techniques to weight the entropy value; however, these approaches do not sufficiently account for the importance of attributes. This paper introduces an Adaptive Entropy Decision Tree (EWDT) algorithm, which leverages eigenvalue importance and integrates singular value decomposition into the calculation of entropy values. Experimental results demonstrate that the proposed algorithm outperforms other decision tree algorithms in terms of accuracy, precision, recall, and F1-score.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100530"},"PeriodicalIF":3.5,"publicationDate":"2025-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143899918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TE-PADN: A poisoning attack defense model based on temporal margin samples TE-PADN:基于时差采样的中毒攻击防御模型
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-04-09 DOI: 10.1016/j.bdr.2025.100528
Haitao He , Ke Liu , Lei Zhang , Ke Xu , Jiazheng Li , Jiadong Ren
With the development of network security research, intrusion detection systems based on deep learning show great potential in network attack detection. As crucial tools for ensuring network information security, these systems themselves are vulnerable to poisoning attacks from attackers. Currently, most poisoning attack defense methods cannot effectively utilize network traffic characteristics and are only effective for specific models, showing poor defense results for other models. Furthermore, detection of poisoning attacks is often overlooked, leading to a lack of timely and effective defense against such attacks. Therefore, we propose a data poisoning defense mechanism called TE-PADN. Firstly, we introduce a temporal margin sample generation algorithm that integrates an attention mechanism. Based on mapping the original data time series into a latent feature space, this algorithm learns the temporal characteristics of the data and focuses on information from different positions using the attention mechanism to generate temporal margin samples for repairing poisoned models. Secondly, we propose a multi-level poisoning attack detection method for real-time and accurate detection of undetected poisoning attacks. By employing ensemble learning methods, this approach enhances model robustness, repairs model classification boundaries that have shifted due to poisoning attacks and achieves efficient defense against poisoning attacks. Finally, experimental validation of our proposed method demonstrates promising results. Under a 10% attack intensity, the average accuracy of TE-PADN in recovering poisoning models increased by 6.5% on the NSL-KDD dataset, 5.3% on the UNSW-NB15 dataset, and 5.9% on the CICIDS2017 dataset.
随着网络安全研究的发展,基于深度学习的入侵检测系统在网络攻击检测方面展现出巨大潜力。作为保障网络信息安全的重要工具,这些系统本身也容易受到攻击者的中毒攻击。目前,大多数中毒攻击防御方法无法有效利用网络流量特征,只能对特定模型有效,对其他模型的防御效果不佳。此外,中毒攻击的检测往往被忽视,导致对此类攻击缺乏及时有效的防御。因此,我们提出了一种名为 TE-PADN 的数据中毒防御机制。首先,我们引入了一种整合了注意力机制的时差值样本生成算法。该算法在将原始数据时间序列映射到潜在特征空间的基础上,学习数据的时间特征,并利用注意力机制关注来自不同位置的信息,从而生成用于修复中毒模型的时间裕度样本。其次,我们提出了一种多层次中毒攻击检测方法,用于实时、准确地检测未发现的中毒攻击。通过采用集合学习方法,该方法增强了模型的鲁棒性,修复了因中毒攻击而发生偏移的模型分类边界,实现了对中毒攻击的高效防御。最后,我们提出的方法经过实验验证,取得了良好的效果。在 10% 的攻击强度下,TE-PADN 在 NSL-KDD 数据集上恢复中毒模型的平均准确率提高了 6.5%,在 UNSW-NB15 数据集上提高了 5.3%,在 CICIDS2017 数据集上提高了 5.9%。
{"title":"TE-PADN: A poisoning attack defense model based on temporal margin samples","authors":"Haitao He ,&nbsp;Ke Liu ,&nbsp;Lei Zhang ,&nbsp;Ke Xu ,&nbsp;Jiazheng Li ,&nbsp;Jiadong Ren","doi":"10.1016/j.bdr.2025.100528","DOIUrl":"10.1016/j.bdr.2025.100528","url":null,"abstract":"<div><div>With the development of network security research, intrusion detection systems based on deep learning show great potential in network attack detection. As crucial tools for ensuring network information security, these systems themselves are vulnerable to poisoning attacks from attackers. Currently, most poisoning attack defense methods cannot effectively utilize network traffic characteristics and are only effective for specific models, showing poor defense results for other models. Furthermore, detection of poisoning attacks is often overlooked, leading to a lack of timely and effective defense against such attacks. Therefore, we propose a data poisoning defense mechanism called TE-PADN. Firstly, we introduce a temporal margin sample generation algorithm that integrates an attention mechanism. Based on mapping the original data time series into a latent feature space, this algorithm learns the temporal characteristics of the data and focuses on information from different positions using the attention mechanism to generate temporal margin samples for repairing poisoned models. Secondly, we propose a multi-level poisoning attack detection method for real-time and accurate detection of undetected poisoning attacks. By employing ensemble learning methods, this approach enhances model robustness, repairs model classification boundaries that have shifted due to poisoning attacks and achieves efficient defense against poisoning attacks. Finally, experimental validation of our proposed method demonstrates promising results. Under a 10% attack intensity, the average accuracy of TE-PADN in recovering poisoning models increased by 6.5% on the NSL-KDD dataset, 5.3% on the UNSW-NB15 dataset, and 5.9% on the CICIDS2017 dataset.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100528"},"PeriodicalIF":3.5,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143816452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Leveraging artificial intelligence for pandemic management: Case of COVID-19 in the United States 利用人工智能进行流行病管理:以美国的COVID-19为例
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-04-08 DOI: 10.1016/j.bdr.2025.100529
Ehsan Ahmadi, Reza Maihami
The COVID-19 pandemic revealed significant limitations in traditional approaches to analyzing time-series data that use one-dimensional data such as historical infection rates. Such approaches do not capture the complex, multifactor influences on disease spread. This paper addresses these challenges by proposing a comprehensive methodology that integrates multiple data sources, including community mobility, census information, Google search trends, socioeconomic variables, vaccination coverage, and political data. In addition, this paper proposes a new cross-learning (CL) methodology that allows for the training of machine learning models on multiple related time series simultaneously, enabling more accurate and robust predictions. Applying the CL approach with four machine learning algorithms, we successfully forecasted confirmed COVID-19 cases 30 days in advance with greater accuracy than the traditional ARIMAX model and the newer Transformer deep learning technique. Our findings identified daily hospital admissions as a significant predictor at the state level and vaccination status at the national level. Random Forest with CL was very effective, performing best in 44 states, while ARIMAX outperformed in seven larger states. These findings highlight the importance of advanced predictive modeling in resource optimization and response strategy development for future health emergencies.
COVID-19大流行表明,使用历史感染率等一维数据分析时间序列数据的传统方法存在重大局限性。这种方法没有捕捉到对疾病传播的复杂的多因素影响。本文通过提出一种综合的方法来解决这些挑战,该方法集成了多个数据源,包括社区流动性、人口普查信息、谷歌搜索趋势、社会经济变量、疫苗接种覆盖率和政治数据。此外,本文提出了一种新的交叉学习(CL)方法,该方法允许同时在多个相关时间序列上训练机器学习模型,从而实现更准确和稳健的预测。采用CL方法和四种机器学习算法,我们成功地提前30天预测了新冠肺炎确诊病例,其准确性高于传统的ARIMAX模型和较新的Transformer深度学习技术。我们的研究结果确定每日住院率是州一级和国家一级疫苗接种状况的重要预测因子。带有CL的随机森林非常有效,在44个州表现最好,而ARIMAX在7个较大的州表现更好。这些发现突出了先进的预测建模在未来突发卫生事件资源优化和应对策略制定中的重要性。
{"title":"Leveraging artificial intelligence for pandemic management: Case of COVID-19 in the United States","authors":"Ehsan Ahmadi,&nbsp;Reza Maihami","doi":"10.1016/j.bdr.2025.100529","DOIUrl":"10.1016/j.bdr.2025.100529","url":null,"abstract":"<div><div>The COVID-19 pandemic revealed significant limitations in traditional approaches to analyzing time-series data that use one-dimensional data such as historical infection rates. Such approaches do not capture the complex, multifactor influences on disease spread. This paper addresses these challenges by proposing a comprehensive methodology that integrates multiple data sources, including community mobility, census information, Google search trends, socioeconomic variables, vaccination coverage, and political data. In addition, this paper proposes a new cross-learning (CL) methodology that allows for the training of machine learning models on multiple related time series simultaneously, enabling more accurate and robust predictions. Applying the CL approach with four machine learning algorithms, we successfully forecasted confirmed COVID-19 cases 30 days in advance with greater accuracy than the traditional ARIMAX model and the newer Transformer deep learning technique. Our findings identified daily hospital admissions as a significant predictor at the state level and vaccination status at the national level. Random Forest with CL was very effective, performing best in 44 states, while ARIMAX outperformed in seven larger states. These findings highlight the importance of advanced predictive modeling in resource optimization and response strategy development for future health emergencies.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100529"},"PeriodicalIF":3.5,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143839334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Settlement patterns, official statistics and geo-economic dynamics: Evidence from a LADISC approach to Italy 聚落模式、官方统计和地缘经济动态:来自意大利LADISC方法的证据
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-03-30 DOI: 10.1016/j.bdr.2025.100525
Gianluigi Salvucci , Luca Salvati , Leonardo Salvatore Alaimo , Ioannis Vardopoulos
Taken as pivotal in explaining settlement patterns, territorial and socioeconomic factors — such as elevation or proximity to water bodies or infrastructures — are evolving amid contemporary trends favouring urbanized areas. Urban centers, transformed over the past decades, attract younger populations because of the inherent proximity to services and infrastructure, amid challenges posed by urban living costs and housing availability. This study extends the Latitude, Altitude, Distance from the Sea, and Proximity to Major Cities (LADISC) model, integrating two additional geographic metrics to provide a refined framework for analyzing population distribution trends. Unlike traditional approaches that rely on administrative boundaries, this model applies geostatistical techniques to high-resolution census data, offering a detailed and dynamic perspective on settlement evolution in Italy. Advanced applications of official data mining with exploratory statistical techniques allow for the uncovering of a significant concentration of elderly populations within urban centers, underscoring the needed for tailored healthcare services and urban amenities. Conversely, we found that younger populations are decentralizing towards suburban areas, reflecting a sudden shift in preferences and mobility patterns. Such trends prompt a reassessment of urban planning and (sustainable) development strategies to accommodate diverse population needs. Our study further explores the impact of Covid-19 pandemic on population distribution, suggesting a potential surge in remote working and digital interactions that are most likely to reshape peri‑urban settlements. By refining the LADISC framework, this study presents an innovative methodology for handling large-scale census data, allowing for spatially explicit demographic analysis that captures population shifts more precisely than traditional methods.
作为解释定居模式的关键因素,领土和社会经济因素- -例如海拔或靠近水体或基础设施- -在有利于城市化地区的当代趋势中正在演变。在城市生活成本和住房供应带来的挑战中,过去几十年转型的城市中心吸引了更年轻的人口,因为它们固有地靠近服务和基础设施。本研究扩展了纬度、海拔、离海距离和靠近主要城市(LADISC)模型,整合了两个额外的地理指标,为分析人口分布趋势提供了一个完善的框架。与依赖行政边界的传统方法不同,该模型将地质统计学技术应用于高分辨率人口普查数据,为意大利的定居演变提供了详细和动态的视角。通过探索性统计技术对官方数据挖掘进行高级应用,可以发现城市中心大量集中的老年人口,强调需要量身定制的医疗保健服务和城市便利设施。相反,我们发现年轻人口正在向郊区分散,反映出偏好和流动模式的突然转变。这种趋势促使重新评估城市规划和(可持续)发展战略,以适应不同的人口需要。我们的研究进一步探讨了Covid-19大流行对人口分布的影响,表明远程工作和数字互动的潜在激增最有可能重塑城郊住区。通过改进LADISC框架,本研究提出了一种处理大规模人口普查数据的创新方法,允许空间明确的人口分析,比传统方法更准确地捕捉人口变化。
{"title":"Settlement patterns, official statistics and geo-economic dynamics: Evidence from a LADISC approach to Italy","authors":"Gianluigi Salvucci ,&nbsp;Luca Salvati ,&nbsp;Leonardo Salvatore Alaimo ,&nbsp;Ioannis Vardopoulos","doi":"10.1016/j.bdr.2025.100525","DOIUrl":"10.1016/j.bdr.2025.100525","url":null,"abstract":"<div><div>Taken as pivotal in explaining settlement patterns, territorial and socioeconomic factors — such as elevation or proximity to water bodies or infrastructures — are evolving amid contemporary trends favouring urbanized areas. Urban centers, transformed over the past decades, attract younger populations because of the inherent proximity to services and infrastructure, amid challenges posed by urban living costs and housing availability. This study extends the Latitude, Altitude, Distance from the Sea, and Proximity to Major Cities (LADISC) model, integrating two additional geographic metrics to provide a refined framework for analyzing population distribution trends. Unlike traditional approaches that rely on administrative boundaries, this model applies geostatistical techniques to high-resolution census data, offering a detailed and dynamic perspective on settlement evolution in Italy. Advanced applications of official data mining with exploratory statistical techniques allow for the uncovering of a significant concentration of elderly populations within urban centers, underscoring the needed for tailored healthcare services and urban amenities. Conversely, we found that younger populations are decentralizing towards suburban areas, reflecting a sudden shift in preferences and mobility patterns. Such trends prompt a reassessment of urban planning and (sustainable) development strategies to accommodate diverse population needs. Our study further explores the impact of Covid-19 pandemic on population distribution, suggesting a potential surge in remote working and digital interactions that are most likely to reshape peri‑urban settlements. By refining the LADISC framework, this study presents an innovative methodology for handling large-scale census data, allowing for spatially explicit demographic analysis that captures population shifts more precisely than traditional methods.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100525"},"PeriodicalIF":3.5,"publicationDate":"2025-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144068457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Women in life sciences firms: Gender diversity and roles indicator from data integration 生命科学公司中的女性:来自数据整合的性别多样性和角色指标
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-03-28 DOI: 10.1016/j.bdr.2025.100526
Laura Benedan , Cinzia Colapinto , Paolo Mariani , Laura Pagani , Mariangela Zenga
The present study examines the state of gender equality and inclusion in Italian life sciences companies. An ad hoc questionnaire was developed and distributed to human resources professionals from various firms with the objective of gathering insights on gender equality practices. Our primary data have been combined with available information from the AIDA database. This included information on the size of the companies in terms of the number of employees and sales revenues. To assess the degree of ' commitment to sustainability and gender equality, we analysed their websites. Three statistical indicators were constructed and combined into a practical synthetic index. This index may be used in future research to quantify and measure each company's overall propensity towards gender equality and inclusion.
本研究考察了意大利生命科学公司的性别平等和包容性状况。编制了一份特别调查问卷,分发给各公司的人力资源专业人员,目的是收集关于性别平等做法的见解。我们的原始数据已与AIDA数据库的现有信息相结合。这包括公司规模的信息,包括员工数量和销售收入。为了评估他们对可持续发展和性别平等的承诺程度,我们分析了他们的网站。构建了三个统计指标,并将其组合为一个实用的综合指标。该指数可以在未来的研究中用于量化和衡量每个公司对性别平等和包容的总体倾向。
{"title":"Women in life sciences firms: Gender diversity and roles indicator from data integration","authors":"Laura Benedan ,&nbsp;Cinzia Colapinto ,&nbsp;Paolo Mariani ,&nbsp;Laura Pagani ,&nbsp;Mariangela Zenga","doi":"10.1016/j.bdr.2025.100526","DOIUrl":"10.1016/j.bdr.2025.100526","url":null,"abstract":"<div><div>The present study examines the state of gender equality and inclusion in Italian life sciences companies. An ad hoc questionnaire was developed and distributed to human resources professionals from various firms with the objective of gathering insights on gender equality practices. Our primary data have been combined with available information from the AIDA database. This included information on the size of the companies in terms of the number of employees and sales revenues. To assess the degree of ' commitment to sustainability and gender equality, we analysed their websites. Three statistical indicators were constructed and combined into a practical synthetic index. This index may be used in future research to quantify and measure each company's overall propensity towards gender equality and inclusion.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100526"},"PeriodicalIF":3.5,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144068458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient, interpretable and automated feature engineering for bank data 银行数据的高效、可解释和自动化特征工程
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-03-28 DOI: 10.1016/j.bdr.2025.100524
Atilla Karaahmetoğlu , Mehmet Yıldız , Erdem Ünal , Uğur Aydın , Murat Koraş , Barış Akgün
Banks rely on expert-generated features and simple models to have high performance and interpretability at the same time. Interpretability is needed for internal assessment and regulatory compliance for specific problems such as risk assessment and both expert generated features and simple models satisfy this need. However, feature generation by experts is a time-consuming process and susceptible to bias. In addition, features need to be generated fairly often due to the dynamic nature of bank data, and in case of significant changes or new data sources, expertise might take a while to build up. Complex models, such as deep neural networks, may be able to remedy this. However, interpretability/explainability approaches for complex models are not satisfactory from the banks' point of view. In addition, such models do not always work well with tabular data which is abundant in banking applications. This paper introduces an automated feature synthesis pipeline that creates informative and domain-interpretable features which iconsumes significantly less time than brute-force methods. We create novel feature synthesis steps, define elimination rules to rule out uninterpretable features, and combine performance-based feature selection methods to pick desirable ones to build our models. Our results on two different datasets show that the features generated with our pipeline; (1) perform on par or better than features generated by existing methods, (2) are obtained faster, and (3) are domain-interpretable.
银行依靠专家生成的特征和简单的模型来同时具有高性能和可解释性。内部评估和特定问题(如风险评估)的法规遵从性需要可解释性,专家生成的特征和简单模型都能满足这一需求。然而,由专家生成特征是一个耗时的过程,并且容易受到偏见的影响。此外,由于银行数据的动态性,需要相当频繁地生成功能,并且在发生重大更改或新数据源的情况下,可能需要一段时间才能建立专门知识。复杂的模型,如深度神经网络,可能能够弥补这一点。然而,从银行的角度来看,复杂模型的可解释性/可解释性方法并不令人满意。此外,这种模型并不总是能很好地处理银行应用中大量的表格数据。本文介绍了一种自动化的特征合成管道,它可以创建信息丰富且领域可解释的特征,比暴力方法消耗的时间要少得多。我们创建了新的特征合成步骤,定义了消除规则来排除不可解释的特征,并结合基于性能的特征选择方法来选择理想的特征来构建我们的模型。我们在两个不同的数据集上的结果表明,我们的管道生成的特征;(1)性能与现有方法生成的特征相当或更好,(2)获得速度更快,(3)可域解释。
{"title":"Efficient, interpretable and automated feature engineering for bank data","authors":"Atilla Karaahmetoğlu ,&nbsp;Mehmet Yıldız ,&nbsp;Erdem Ünal ,&nbsp;Uğur Aydın ,&nbsp;Murat Koraş ,&nbsp;Barış Akgün","doi":"10.1016/j.bdr.2025.100524","DOIUrl":"10.1016/j.bdr.2025.100524","url":null,"abstract":"<div><div>Banks rely on expert-generated features and simple models to have high performance and interpretability at the same time. Interpretability is needed for internal assessment and regulatory compliance for specific problems such as risk assessment and both expert generated features and simple models satisfy this need. However, feature generation by experts is a time-consuming process and susceptible to bias. In addition, features need to be generated fairly often due to the dynamic nature of bank data, and in case of significant changes or new data sources, expertise might take a while to build up. Complex models, such as deep neural networks, may be able to remedy this. However, interpretability/explainability approaches for complex models are not satisfactory from the banks' point of view. In addition, such models do not always work well with tabular data which is abundant in banking applications. This paper introduces an automated feature synthesis pipeline that creates informative and domain-interpretable features which iconsumes significantly less time than brute-force methods. We create novel feature synthesis steps, define elimination rules to rule out uninterpretable features, and combine performance-based feature selection methods to pick desirable ones to build our models. Our results on two different datasets show that the features generated with our pipeline; (1) perform on par or better than features generated by existing methods, (2) are obtained faster, and (3) are domain-interpretable.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100524"},"PeriodicalIF":3.5,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143790985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NoSQL data warehouse optimizing models: A comparative study of column-oriented approaches NoSQL数据仓库优化模型:面向列方法的比较研究
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-03-20 DOI: 10.1016/j.bdr.2025.100523
Mohamed Mouhiha, Abdelfettah Mabrouk
There is a great challenge when building an efficient Big Data Warehouse (DW) from the traditional data warehouse which used to handle the large datasets. Several presented solutions concentrate on the conversion of a standard DW to an columnar model, especially for direct and traditional data sources. Though there have been many successful algorithms that apply data clustering methods, these approaches also come with their fair share of limitations. This paper provides a comprehensive review of the existing methods, both tuned and out-of-the box, exposing their strengths and weaknesses. Further, a comparative study of the different options is always conducted to compare and assess them.
在传统数据仓库基础上构建高效的大数据仓库(DW)是一个巨大的挑战。提出的几个解决方案集中于将标准DW转换为柱状模型,特别是对于直接数据源和传统数据源。尽管已经有许多成功的算法应用了数据聚类方法,但这些方法也有它们的局限性。本文提供了对现有方法的全面回顾,包括已调优的和开箱即用的,揭示了它们的优点和缺点。此外,总是对不同的选择进行比较研究,以比较和评估它们。
{"title":"NoSQL data warehouse optimizing models: A comparative study of column-oriented approaches","authors":"Mohamed Mouhiha,&nbsp;Abdelfettah Mabrouk","doi":"10.1016/j.bdr.2025.100523","DOIUrl":"10.1016/j.bdr.2025.100523","url":null,"abstract":"<div><div>There is a great challenge when building an efficient Big Data Warehouse (DW) from the traditional data warehouse which used to handle the large datasets. Several presented solutions concentrate on the conversion of a standard DW to an columnar model, especially for direct and traditional data sources. Though there have been many successful algorithms that apply data clustering methods, these approaches also come with their fair share of limitations. This paper provides a comprehensive review of the existing methods, both tuned and out-of-the box, exposing their strengths and weaknesses. Further, a comparative study of the different options is always conducted to compare and assess them.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100523"},"PeriodicalIF":3.5,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143681953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-dimensional feature learning for visible-infrared person re-identification 基于多维特征学习的可见-红外人物再识别
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-03-17 DOI: 10.1016/j.bdr.2025.100522
Zhenzhen Yang, Xinyi Wu, Yongpeng Yang
Visible-infrared person re-identification (VI-ReID) is a challenging task due to significant differences between modalities and feature representation of visible and infrared images. The primary goal of current VI-ReID is to reduce discrepancies between modalities. However, existing research primarily focuses on learning modality-invariant features. Due to significant modality differences, it is challenging to learn an effectively common feature space. Moreover, the intra-modality differences have not been well addressed. Therefore, a novel multi-dimensional feature learning network (MFLNet) is proposed in this paper to tackle the inherent challenges of intra-modality and inter-modality differences in VI-ReID. Specifically, to effectively address intra-modality variations, we employ the random local shear (RLS) augmentation, which accurately simulates viewpoint and posture changes. This augmentation can be seamlessly incorporated into other methods without modifying the network or parameters. Additionally, we integrate the multi-dimensional information mining (MIM) module to extract discriminative features and bridge the gap between modalities. Moreover, the cyclical smoothing focal (CSF) loss is introduced to prioritize challenging samples during training, thereby enhancing the ReID performance. Finally, the experimental results indicate that the proposed MFLNet outperforms other VI-ReID approaches on the SYSU-MM01, RegDB and LLCM datasets.
由于可见和红外图像的模态和特征表示存在显著差异,可见-红外人物再识别(VI-ReID)是一项具有挑战性的任务。当前VI-ReID的主要目标是减少模式之间的差异。然而,现有的研究主要集中在模态不变特征的学习上。由于存在显著的模态差异,学习有效的公共特征空间具有挑战性。此外,模态内的差异还没有得到很好的解决。因此,本文提出了一种新的多维特征学习网络(MFLNet)来解决VI-ReID中模态内和模态间差异的固有挑战。具体来说,为了有效地解决模态内的变化,我们采用了随机局部剪切(RLS)增强,它准确地模拟了视点和姿态的变化。这种增强可以无缝地集成到其他方法中,而无需修改网络或参数。此外,我们还集成了多维信息挖掘(MIM)模块来提取判别特征并弥合模式之间的差距。此外,引入周期性平滑焦点(CSF)损失,在训练过程中优先考虑具有挑战性的样本,从而提高ReID性能。最后,实验结果表明,在SYSU-MM01、RegDB和LLCM数据集上,所提出的MFLNet优于其他VI-ReID方法。
{"title":"Multi-dimensional feature learning for visible-infrared person re-identification","authors":"Zhenzhen Yang,&nbsp;Xinyi Wu,&nbsp;Yongpeng Yang","doi":"10.1016/j.bdr.2025.100522","DOIUrl":"10.1016/j.bdr.2025.100522","url":null,"abstract":"<div><div>Visible-infrared person re-identification (VI-ReID) is a challenging task due to significant differences between modalities and feature representation of visible and infrared images. The primary goal of current VI-ReID is to reduce discrepancies between modalities. However, existing research primarily focuses on learning modality-invariant features. Due to significant modality differences, it is challenging to learn an effectively common feature space. Moreover, the intra-modality differences have not been well addressed. Therefore, a novel multi-dimensional feature learning network (MFLNet) is proposed in this paper to tackle the inherent challenges of intra-modality and inter-modality differences in VI-ReID. Specifically, to effectively address intra-modality variations, we employ the random local shear (RLS) augmentation, which accurately simulates viewpoint and posture changes. This augmentation can be seamlessly incorporated into other methods without modifying the network or parameters. Additionally, we integrate the multi-dimensional information mining (MIM) module to extract discriminative features and bridge the gap between modalities. Moreover, the cyclical smoothing focal (CSF) loss is introduced to prioritize challenging samples during training, thereby enhancing the ReID performance. Finally, the experimental results indicate that the proposed MFLNet outperforms other VI-ReID approaches on the SYSU-MM01, RegDB and LLCM datasets.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100522"},"PeriodicalIF":3.5,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143654669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep attention dynamic representation learning networks for recommender system review modeling 基于深度关注动态表征学习网络的推荐系统评审建模
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-03-15 DOI: 10.1016/j.bdr.2025.100521
Shivangi Gheewala , Shuxiang Xu , Soonja Yeom
Despite considerable research of utilizing deep learning technology and textual reviews in recommender systems, improving system performance is a contentious matter. This is primarily due to issues faced in learning user-item representations. One issue is the limited ability of networks to model dynamic user-item representations from reviews. Particularly, in sequence-to-sequence learning models, there appears a substantial likelihood of losing semantic knowledge of previous review sequences, as overridden by the next. Another issue lies in effectively integrating global-level and topical-level representations to extract informative content and enhance user-item representations. Existing methods struggle to maintain contextual consistency during this integration process, resulting in suboptimal representation learning, especially attempting to capture finer details. To address these issues, we propose a novel recommendation model called Deep Attention Dynamic Representation Learning (DADRL). Specifically, we employ Latent Dirichlet Allocation and dynamic modulator-based Long Short-Term Memory to extract topical and dynamic global representations. Then, we introduce an attentional fusion methodology to integrate these representations in a contextually consistent manner and construct informative attentional user-item representations. We use these representations into the factorization machines layer to predict the final scores. Experimental results on Amazon categories, Yelp, and LibraryThing show that our model exhibits superior performance compared to several state-of-the-arts. We further examine the DADRL architecture under various conditions to provide insights on the model's employed components.
尽管在推荐系统中利用深度学习技术和文本评论进行了大量研究,但提高系统性能是一个有争议的问题。这主要是由于在学习用户-项目表示时面临的问题。一个问题是网络从评论中对动态用户-项目表示建模的能力有限。特别是,在序列到序列的学习模型中,由于被下一个序列覆盖,丢失先前复习序列的语义知识的可能性很大。另一个问题是如何有效地集成全局级和主题级表示,以提取信息内容并增强用户项表示。在这个整合过程中,现有的方法很难保持上下文的一致性,这导致了次优的表示学习,尤其是在试图捕捉更精细的细节时。为了解决这些问题,我们提出了一种新的推荐模型,称为深度注意动态表征学习(DADRL)。具体来说,我们使用潜在狄利克雷分配和基于动态调制器的长短期记忆来提取局部和动态全局表征。然后,我们引入一种注意融合方法,以上下文一致的方式整合这些表征,并构建信息丰富的注意用户-物品表征。我们将这些表示用于分解机器层来预测最终分数。在Amazon categories、Yelp和LibraryThing上的实验结果表明,与几种最先进的技术相比,我们的模型表现出优越的性能。我们进一步研究不同条件下的DADRL体系结构,以提供对模型所使用组件的见解。
{"title":"Deep attention dynamic representation learning networks for recommender system review modeling","authors":"Shivangi Gheewala ,&nbsp;Shuxiang Xu ,&nbsp;Soonja Yeom","doi":"10.1016/j.bdr.2025.100521","DOIUrl":"10.1016/j.bdr.2025.100521","url":null,"abstract":"<div><div>Despite considerable research of utilizing deep learning technology and textual reviews in recommender systems, improving system performance is a contentious matter. This is primarily due to issues faced in learning user-item representations. One issue is the limited ability of networks to model dynamic user-item representations from reviews. Particularly, in sequence-to-sequence learning models, there appears a substantial likelihood of losing semantic knowledge of previous review sequences, as overridden by the next. Another issue lies in effectively integrating global-level and topical-level representations to extract informative content and enhance user-item representations. Existing methods struggle to maintain contextual consistency during this integration process, resulting in suboptimal representation learning, especially attempting to capture finer details. To address these issues, we propose a novel recommendation model called Deep Attention Dynamic Representation Learning (DADRL). Specifically, we employ Latent Dirichlet Allocation and dynamic modulator-based Long Short-Term Memory to extract topical and dynamic global representations. Then, we introduce an attentional fusion methodology to integrate these representations in a contextually consistent manner and construct informative attentional user-item representations. We use these representations into the factorization machines layer to predict the final scores. Experimental results on Amazon categories, Yelp, and LibraryThing show that our model exhibits superior performance compared to several state-of-the-arts. We further examine the DADRL architecture under various conditions to provide insights on the model's employed components.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100521"},"PeriodicalIF":3.5,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143681952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Complex data in tourism analysis: A stochastic approach to price competition 旅游分析中的复杂数据:价格竞争的随机方法
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-03-13 DOI: 10.1016/j.bdr.2025.100520
Giovanni Angelini , Michele Costa , Andrea Guizzardi
This study examines pricing strategies and decision-making processes in the hospitality industry by analyzing “ask” prices on online travel agencies (i.e., the rates at which hoteliers are willing to sell their rooms). We face the challenge of modeling a continuous flow of big data organized as “time series of time series,” where daily seasonality and advance bookings intersect. Our research combines insights from tourism, quantitative methods, and big data to improve pricing strategies, contributing to both theory and practice in revenue management. Focusing on Venice, we analyze price competition as a multivariate stochastic process using a Structural Vector Autoregressive (SVAR) approach, aligning with modern dynamic pricing algorithms.
The findings show that time-based pricing strategies, which adjust based on the day of arrival and booking, are more important than room features in setting hotel prices. We also find that price changes have a non-linear and decreasing effect as the booking date approaches. These insights suggest that hotels could create more advanced pricing strategies, and policymakers should consider these factors when addressing the challenges related to overtourism.
We study the complex competitive relationships among heterogeneous service providers with an approach applicable to any market where consumption is delayed relative to purchase time. However, we highlight that the quality and accessibility of information in the tourism sector are key aspects to be considered when using big data in this industry.
本研究通过分析在线旅行社的“询问”价格(即酒店经营者愿意出售客房的价格),考察了酒店行业的定价策略和决策过程。我们面临的挑战是如何对连续的大数据流进行建模,这些数据被组织为“时间序列的时间序列”,其中每日季节性和提前预订交织在一起。我们的研究结合了旅游业、定量方法和大数据的见解,以改进定价策略,为收入管理的理论和实践做出贡献。以威尼斯为例,我们使用结构向量自回归(SVAR)方法分析价格竞争作为一个多元随机过程,与现代动态定价算法保持一致。研究结果显示,在制定酒店价格时,基于时间的定价策略(根据抵达日期和预订情况进行调整)比客房特征更重要。我们还发现,随着预订日期的临近,价格变化具有非线性和递减效应。这些见解表明,酒店可以制定更先进的定价策略,政策制定者在应对与过度旅游相关的挑战时应该考虑这些因素。我们用一种适用于任何消费相对于购买时间延迟的市场的方法研究了异构服务提供商之间复杂的竞争关系。然而,我们强调,在旅游业中使用大数据时,信息的质量和可及性是需要考虑的关键方面。
{"title":"Complex data in tourism analysis: A stochastic approach to price competition","authors":"Giovanni Angelini ,&nbsp;Michele Costa ,&nbsp;Andrea Guizzardi","doi":"10.1016/j.bdr.2025.100520","DOIUrl":"10.1016/j.bdr.2025.100520","url":null,"abstract":"<div><div>This study examines pricing strategies and decision-making processes in the hospitality industry by analyzing “ask” prices on online travel agencies (i.e., the rates at which hoteliers are willing to sell their rooms). We face the challenge of modeling a continuous flow of big data organized as “time series of time series,” where daily seasonality and advance bookings intersect. Our research combines insights from tourism, quantitative methods, and big data to improve pricing strategies, contributing to both theory and practice in revenue management. Focusing on Venice, we analyze price competition as a multivariate stochastic process using a Structural Vector Autoregressive (SVAR) approach, aligning with modern dynamic pricing algorithms.</div><div>The findings show that time-based pricing strategies, which adjust based on the day of arrival and booking, are more important than room features in setting hotel prices. We also find that price changes have a non-linear and decreasing effect as the booking date approaches. These insights suggest that hotels could create more advanced pricing strategies, and policymakers should consider these factors when addressing the challenges related to overtourism.</div><div>We study the complex competitive relationships among heterogeneous service providers with an approach applicable to any market where consumption is delayed relative to purchase time. However, we highlight that the quality and accessibility of information in the tourism sector are key aspects to be considered when using big data in this industry.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100520"},"PeriodicalIF":3.5,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143641669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Big Data Research
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1