Big Data Research最新文献

英文中文

NoSQL data warehouse optimizing models: A comparative study of column-oriented approaches

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Research

Pub Date : 2025-03-20 DOI: 10.1016/j.bdr.2025.100523

Mohamed Mouhiha, Abdelfettah Mabrouk

There is a great challenge when building an efficient Big Data Warehouse (DW) from the traditional data warehouse which used to handle the large datasets. Several presented solutions concentrate on the conversion of a standard DW to an columnar model, especially for direct and traditional data sources. Though there have been many successful algorithms that apply data clustering methods, these approaches also come with their fair share of limitations. This paper provides a comprehensive review of the existing methods, both tuned and out-of-the box, exposing their strengths and weaknesses. Further, a comparative study of the different options is always conducted to compare and assess them.

引用次数: 0

Multi-dimensional feature learning for visible-infrared person re-identification

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Research

Pub Date : 2025-03-17 DOI: 10.1016/j.bdr.2025.100522

Zhenzhen Yang, Xinyi Wu, Yongpeng Yang

Visible-infrared person re-identification (VI-ReID) is a challenging task due to significant differences between modalities and feature representation of visible and infrared images. The primary goal of current VI-ReID is to reduce discrepancies between modalities. However, existing research primarily focuses on learning modality-invariant features. Due to significant modality differences, it is challenging to learn an effectively common feature space. Moreover, the intra-modality differences have not been well addressed. Therefore, a novel multi-dimensional feature learning network (MFLNet) is proposed in this paper to tackle the inherent challenges of intra-modality and inter-modality differences in VI-ReID. Specifically, to effectively address intra-modality variations, we employ the random local shear (RLS) augmentation, which accurately simulates viewpoint and posture changes. This augmentation can be seamlessly incorporated into other methods without modifying the network or parameters. Additionally, we integrate the multi-dimensional information mining (MIM) module to extract discriminative features and bridge the gap between modalities. Moreover, the cyclical smoothing focal (CSF) loss is introduced to prioritize challenging samples during training, thereby enhancing the ReID performance. Finally, the experimental results indicate that the proposed MFLNet outperforms other VI-ReID approaches on the SYSU-MM01, RegDB and LLCM datasets.

{"title":"Multi-dimensional feature learning for visible-infrared person re-identification","authors":"Zhenzhen Yang, Xinyi Wu, Yongpeng Yang","doi":"10.1016/j.bdr.2025.100522","DOIUrl":"10.1016/j.bdr.2025.100522","url":null,"abstract":"<div><div>Visible-infrared person re-identification (VI-ReID) is a challenging task due to significant differences between modalities and feature representation of visible and infrared images. The primary goal of current VI-ReID is to reduce discrepancies between modalities. However, existing research primarily focuses on learning modality-invariant features. Due to significant modality differences, it is challenging to learn an effectively common feature space. Moreover, the intra-modality differences have not been well addressed. Therefore, a novel multi-dimensional feature learning network (MFLNet) is proposed in this paper to tackle the inherent challenges of intra-modality and inter-modality differences in VI-ReID. Specifically, to effectively address intra-modality variations, we employ the random local shear (RLS) augmentation, which accurately simulates viewpoint and posture changes. This augmentation can be seamlessly incorporated into other methods without modifying the network or parameters. Additionally, we integrate the multi-dimensional information mining (MIM) module to extract discriminative features and bridge the gap between modalities. Moreover, the cyclical smoothing focal (CSF) loss is introduced to prioritize challenging samples during training, thereby enhancing the ReID performance. Finally, the experimental results indicate that the proposed MFLNet outperforms other VI-ReID approaches on the SYSU-MM01, RegDB and LLCM datasets.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100522"},"PeriodicalIF":3.5,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143654669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deep attention dynamic representation learning networks for recommender system review modeling

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Research

Pub Date : 2025-03-15 DOI: 10.1016/j.bdr.2025.100521

Shivangi Gheewala , Shuxiang Xu , Soonja Yeom

Despite considerable research of utilizing deep learning technology and textual reviews in recommender systems, improving system performance is a contentious matter. This is primarily due to issues faced in learning user-item representations. One issue is the limited ability of networks to model dynamic user-item representations from reviews. Particularly, in sequence-to-sequence learning models, there appears a substantial likelihood of losing semantic knowledge of previous review sequences, as overridden by the next. Another issue lies in effectively integrating global-level and topical-level representations to extract informative content and enhance user-item representations. Existing methods struggle to maintain contextual consistency during this integration process, resulting in suboptimal representation learning, especially attempting to capture finer details. To address these issues, we propose a novel recommendation model called Deep Attention Dynamic Representation Learning (DADRL). Specifically, we employ Latent Dirichlet Allocation and dynamic modulator-based Long Short-Term Memory to extract topical and dynamic global representations. Then, we introduce an attentional fusion methodology to integrate these representations in a contextually consistent manner and construct informative attentional user-item representations. We use these representations into the factorization machines layer to predict the final scores. Experimental results on Amazon categories, Yelp, and LibraryThing show that our model exhibits superior performance compared to several state-of-the-arts. We further examine the DADRL architecture under various conditions to provide insights on the model's employed components.

{"title":"Deep attention dynamic representation learning networks for recommender system review modeling","authors":"Shivangi Gheewala , Shuxiang Xu , Soonja Yeom","doi":"10.1016/j.bdr.2025.100521","DOIUrl":"10.1016/j.bdr.2025.100521","url":null,"abstract":"<div><div>Despite considerable research of utilizing deep learning technology and textual reviews in recommender systems, improving system performance is a contentious matter. This is primarily due to issues faced in learning user-item representations. One issue is the limited ability of networks to model dynamic user-item representations from reviews. Particularly, in sequence-to-sequence learning models, there appears a substantial likelihood of losing semantic knowledge of previous review sequences, as overridden by the next. Another issue lies in effectively integrating global-level and topical-level representations to extract informative content and enhance user-item representations. Existing methods struggle to maintain contextual consistency during this integration process, resulting in suboptimal representation learning, especially attempting to capture finer details. To address these issues, we propose a novel recommendation model called Deep Attention Dynamic Representation Learning (DADRL). Specifically, we employ Latent Dirichlet Allocation and dynamic modulator-based Long Short-Term Memory to extract topical and dynamic global representations. Then, we introduce an attentional fusion methodology to integrate these representations in a contextually consistent manner and construct informative attentional user-item representations. We use these representations into the factorization machines layer to predict the final scores. Experimental results on Amazon categories, Yelp, and LibraryThing show that our model exhibits superior performance compared to several state-of-the-arts. We further examine the DADRL architecture under various conditions to provide insights on the model's employed components.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100521"},"PeriodicalIF":3.5,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143681952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Complex data in tourism analysis: A stochastic approach to price competition

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Research

Pub Date : 2025-03-13 DOI: 10.1016/j.bdr.2025.100520

Giovanni Angelini , Michele Costa , Andrea Guizzardi

This study examines pricing strategies and decision-making processes in the hospitality industry by analyzing “ask” prices on online travel agencies (i.e., the rates at which hoteliers are willing to sell their rooms). We face the challenge of modeling a continuous flow of big data organized as “time series of time series,” where daily seasonality and advance bookings intersect. Our research combines insights from tourism, quantitative methods, and big data to improve pricing strategies, contributing to both theory and practice in revenue management. Focusing on Venice, we analyze price competition as a multivariate stochastic process using a Structural Vector Autoregressive (SVAR) approach, aligning with modern dynamic pricing algorithms.

The findings show that time-based pricing strategies, which adjust based on the day of arrival and booking, are more important than room features in setting hotel prices. We also find that price changes have a non-linear and decreasing effect as the booking date approaches. These insights suggest that hotels could create more advanced pricing strategies, and policymakers should consider these factors when addressing the challenges related to overtourism.

We study the complex competitive relationships among heterogeneous service providers with an approach applicable to any market where consumption is delayed relative to purchase time. However, we highlight that the quality and accessibility of information in the tourism sector are key aspects to be considered when using big data in this industry.

{"title":"Complex data in tourism analysis: A stochastic approach to price competition","authors":"Giovanni Angelini , Michele Costa , Andrea Guizzardi","doi":"10.1016/j.bdr.2025.100520","DOIUrl":"10.1016/j.bdr.2025.100520","url":null,"abstract":"<div><div>This study examines pricing strategies and decision-making processes in the hospitality industry by analyzing “ask” prices on online travel agencies (i.e., the rates at which hoteliers are willing to sell their rooms). We face the challenge of modeling a continuous flow of big data organized as “time series of time series,” where daily seasonality and advance bookings intersect. Our research combines insights from tourism, quantitative methods, and big data to improve pricing strategies, contributing to both theory and practice in revenue management. Focusing on Venice, we analyze price competition as a multivariate stochastic process using a Structural Vector Autoregressive (SVAR) approach, aligning with modern dynamic pricing algorithms.</div><div>The findings show that time-based pricing strategies, which adjust based on the day of arrival and booking, are more important than room features in setting hotel prices. We also find that price changes have a non-linear and decreasing effect as the booking date approaches. These insights suggest that hotels could create more advanced pricing strategies, and policymakers should consider these factors when addressing the challenges related to overtourism.</div><div>We study the complex competitive relationships among heterogeneous service providers with an approach applicable to any market where consumption is delayed relative to purchase time. However, we highlight that the quality and accessibility of information in the tourism sector are key aspects to be considered when using big data in this industry.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100520"},"PeriodicalIF":3.5,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143641669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ImDMI: Improved Distributed M-Invariance model to achieve privacy continuous big data publishing using Apache Spark

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Research

Pub Date : 2025-03-07 DOI: 10.1016/j.bdr.2025.100519

Salheddine Kabou , Laid Gasmi , Abdelbaset Kabou , Sidi Mohammed Benslimane

One of the critical challenges in the big data analytics is the individual's privacy issues. Data anonymization models including k-anonymity and l-diversity are used to guarantee the tradeoff between privacy and data utility while publishing the data. However, these models focus only on the single release of datasets and produce a certain level of privacy. In practical big data applications, data publishing is more complicated where the data is published continuously as new data is collected, and the privacy should be achieved for different releases. In this research, we propose a new distributed bottom up approach on Apache Spark for achievement of the m-invariance privacy model in the continuous big data context. The proposed approach, which is the first study that deals with dynamic big data publishing, is based on the insertion and the split process. In the first process, the data records collected from different workers are inserted into an improved bottom up R-tree generalization in order to minimizing the information loss. The second process concentrates on splitting the overflowed node with respect to the m-invariance model requirement by minimizing the overlap between the resulting partitions. The experimental results show significant improvement in term of data utility, execution time and counterfeit data records as compared to existing techniques in the literature.

{"title":"ImDMI: Improved Distributed M-Invariance model to achieve privacy continuous big data publishing using Apache Spark","authors":"Salheddine Kabou , Laid Gasmi , Abdelbaset Kabou , Sidi Mohammed Benslimane","doi":"10.1016/j.bdr.2025.100519","DOIUrl":"10.1016/j.bdr.2025.100519","url":null,"abstract":"<div><div>One of the critical challenges in the big data analytics is the individual's privacy issues. Data anonymization models including k-anonymity and l-diversity are used to guarantee the tradeoff between privacy and data utility while publishing the data. However, these models focus only on the single release of datasets and produce a certain level of privacy. In practical big data applications, data publishing is more complicated where the data is published continuously as new data is collected, and the privacy should be achieved for different releases. In this research, we propose a new distributed bottom up approach on Apache Spark for achievement of the m-invariance privacy model in the continuous big data context. The proposed approach, which is the first study that deals with dynamic big data publishing, is based on the insertion and the split process. In the first process, the data records collected from different workers are inserted into an improved bottom up R-tree generalization in order to minimizing the information loss. The second process concentrates on splitting the overflowed node with respect to the m-invariance model requirement by minimizing the overlap between the resulting partitions. The experimental results show significant improvement in term of data utility, execution time and counterfeit data records as compared to existing techniques in the literature.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100519"},"PeriodicalIF":3.5,"publicationDate":"2025-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143609162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Predicting option prices: From the Black-Scholes model to machine learning methods

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Research

Pub Date : 2025-02-26 DOI: 10.1016/j.bdr.2025.100518

Angela Maria D'Uggento, Marta Biancardi, Domenico Ciriello

In the ever-changing landscape of financial markets, accurate option pricing remains critical for investors, traders and financial institutions. Traditionally, the Black-Scholes (B&S) model has been the cornerstone for option pricing, providing a solid framework based on mathematical and physical principles. Nevertheless, the B&S model has some limitations, such as the restriction to European options, the absence of dividends, constant volatility, etc. Studies and academic literature on the application of machine learning models in the financial sector are rapidly increasing. The main objective of this paper is to provide a comprehensive comparative analysis between the traditional B&S model and the most commonly used machine learning algorithms such as Artificial Neural Networks (ANNs). The rationale is twofold. First, to examine the assumptions of the B&S model, such as constant volatility and a perfectly efficient market, in light of the complexity of the real world, even though it is recognized that the model has been known as a pillar for decades. Secondly, to emphasize that the proliferation of big data and advances in computing power have fuelled the rise of machine learning techniques in finance. These algorithms have remarkable capabilities in discovering non-linear patterns and extracting information from large data sets, providing a compelling alternative to traditional quantitative methods. Machine learning offers a new way to capture and model such complex financial dynamics, which can lead to more accurate pricing models. By comparing the B&S model and some machine learning approaches, this paper aims to shed light on their respective strengths, weaknesses and applicability in the context of options pricing using real data. Through rigorous empirical analyses and performance metrics, our results demonstrate the importance of using machine learning techniques that can outperform or complement the established B&S model in predicting option prices by achieving higher prediction accuracy.

{"title":"Predicting option prices: From the Black-Scholes model to machine learning methods","authors":"Angela Maria D'Uggento, Marta Biancardi, Domenico Ciriello","doi":"10.1016/j.bdr.2025.100518","DOIUrl":"10.1016/j.bdr.2025.100518","url":null,"abstract":"<div><div>In the ever-changing landscape of financial markets, accurate option pricing remains critical for investors, traders and financial institutions. Traditionally, the Black-Scholes (B&S) model has been the cornerstone for option pricing, providing a solid framework based on mathematical and physical principles. Nevertheless, the B&S model has some limitations, such as the restriction to European options, the absence of dividends, constant volatility, etc. Studies and academic literature on the application of machine learning models in the financial sector are rapidly increasing. The main objective of this paper is to provide a comprehensive comparative analysis between the traditional B&S model and the most commonly used machine learning algorithms such as Artificial Neural Networks (ANNs). The rationale is twofold. First, to examine the assumptions of the B&S model, such as constant volatility and a perfectly efficient market, in light of the complexity of the real world, even though it is recognized that the model has been known as a pillar for decades. Secondly, to emphasize that the proliferation of big data and advances in computing power have fuelled the rise of machine learning techniques in finance. These algorithms have remarkable capabilities in discovering non-linear patterns and extracting information from large data sets, providing a compelling alternative to traditional quantitative methods. Machine learning offers a new way to capture and model such complex financial dynamics, which can lead to more accurate pricing models. By comparing the B&S model and some machine learning approaches, this paper aims to shed light on their respective strengths, weaknesses and applicability in the context of options pricing using real data. Through rigorous empirical analyses and performance metrics, our results demonstrate the importance of using machine learning techniques that can outperform or complement the established B&S model in predicting option prices by achieving higher prediction accuracy.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100518"},"PeriodicalIF":3.5,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143520058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Modeling meaningful volatility events to classify monetary policy announcements

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Research

Pub Date : 2025-02-26 DOI: 10.1016/j.bdr.2025.100517

Giampiero M. Gallo , Demetrio Lacava , Edoardo Otranto

Central Bank monetary policy interventions frequently have direct implications for financial market volatility. In this paper, we introduce an intradaily Asymmetric Multiplicative Error Model with Meaningful Volatility (MV) events (AMEM-MV), which decomposes realized variance into a base component and an MV component. A novel model-based classification of monetary announcements is developed based on their impact on the MV component of the variance. By focusing on the 30-minute window following each Federal Reserve communication, we isolate the specific impact of monetary announcements on the volatility of seven US tickers.

引用次数: 0

Efficient training: Federated learning cost analysis

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Research

Pub Date : 2025-02-20 DOI: 10.1016/j.bdr.2025.100510

Rafael Teixeira , Leonardo Almeida , Mário Antunes , Diogo Gomes , Rui L. Aguiar

With the rapid development of 6G, Artificial Intelligence (AI) is expected to play a pivotal role in network management, resource optimization, and intrusion detection. However, deploying AI models in 6G networks faces several challenges, such as the lack of dedicated hardware for AI tasks and the need to protect user privacy. To address these challenges, Federated Learning (FL) emerges as a promising solution for distributed AI training without the need to move data from users' devices. This paper investigates the performance and costs of different FL approaches regarding training time, communication overhead, and energy consumption. The results show that FL can significantly accelerate the training process while reducing the data transferred across the network. However, the effectiveness of FL depends on the specific FL approach and the network conditions.

引用次数: 0

Improved Tesseract optical character recognition performance on Thai document datasets

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Research

Pub Date : 2025-02-08 DOI: 10.1016/j.bdr.2025.100508

Noppol Anakpluek, Watcharakorn Pasanta, Latthawan Chantharasukha, Pattanawong Chokratansombat, Pajaya Kanjanakaew, Thitirat Siriborvornratanakul

This research aims to improve the accuracy and efficiency of Optical Character Recognition (OCR) technology for the Thai language, specifically in the context of Thai government documents. OCR enables the conversion of text from images into machine-readable format, facilitating document storage and further processing. However, applying OCR to the Thai language presents unique challenges due to its complexity. This study focuses on enhancing the performance of the Tesseract OCR engine, a widely used free OCR technology, by implementing various image preprocessing techniques such as masking, adaptive thresholds, median filtering, Canny edge detection, and morphological operators. A dataset of Thai documents is utilized, and the OCR system's output is evaluated using word error rate (WER) and character error rate (CER) metrics. To improve text extraction accuracy, the research employs the original U-Net architecture [19] for image segmentation. Furthermore, the Tesseract OCR engine is finetuned, and image preprocessing is performed to optimize OCR system accuracy. The developed tools automate workflow processes, alleviate constraints on model training, and enable the effective utilization of information from official Thai documents for various purposes.

{"title":"Improved Tesseract optical character recognition performance on Thai document datasets","authors":"Noppol Anakpluek, Watcharakorn Pasanta, Latthawan Chantharasukha, Pattanawong Chokratansombat, Pajaya Kanjanakaew, Thitirat Siriborvornratanakul","doi":"10.1016/j.bdr.2025.100508","DOIUrl":"10.1016/j.bdr.2025.100508","url":null,"abstract":"<div><div>This research aims to improve the accuracy and efficiency of Optical Character Recognition (OCR) technology for the Thai language, specifically in the context of Thai government documents. OCR enables the conversion of text from images into machine-readable format, facilitating document storage and further processing. However, applying OCR to the Thai language presents unique challenges due to its complexity. This study focuses on enhancing the performance of the Tesseract OCR engine, a widely used free OCR technology, by implementing various image preprocessing techniques such as masking, adaptive thresholds, median filtering, Canny edge detection, and morphological operators. A dataset of Thai documents is utilized, and the OCR system's output is evaluated using word error rate (WER) and character error rate (CER) metrics. To improve text extraction accuracy, the research employs the original U-Net architecture [<span><span>19</span></span>] for image segmentation. Furthermore, the Tesseract OCR engine is finetuned, and image preprocessing is performed to optimize OCR system accuracy. The developed tools automate workflow processes, alleviate constraints on model training, and enable the effective utilization of information from official Thai documents for various purposes.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"39 ","pages":"Article 100508"},"PeriodicalIF":3.5,"publicationDate":"2025-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A novel approach for job matching and skill recommendation using transformers and the O*NET database

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Research

Pub Date : 2025-02-07 DOI: 10.1016/j.bdr.2025.100509

Rubén Alonso , Danilo Dessí , Antonello Meloni , Diego Reforgiato Recupero

Today we have tons of information posted on the web every day regarding job supply and demand which has heavily affected the job market. The online enrolling process has thus become efficient for applicants as it allows them to present their resumes using the Internet and, as such, simultaneously to numerous organizations. Online systems such as Monster.com, OfferZen, and LinkedIn contain millions of job offers and resumes of potential candidates leaving to companies with the hard task to face an enormous amount of data to manage to select the most suitable applicant. The task of assessing the resumes of candidates and providing automatic recommendations on which one suits a particular position best has, therefore, become essential to speed up the hiring process. Similarly, it is important to help applicants to quickly find a job appropriate to their skills and provide recommendations about what they need to master to become eligible for certain jobs. Our approach lies in this context and proposes a new method to identify skills from candidates' resumes and match resumes with job descriptions. We employed the O*NET database entities related to different skills and abilities required by different jobs; moreover, we leveraged deep learning technologies to compute the semantic similarity between O*NET entities and part of text extracted from candidates' resumes. The ultimate goal is to identify the most suitable job for a certain resume according to the information there contained. We have defined two scenarios: i) given a resume, identify the top O*NET occupations with the highest match with the resume, ii) given a candidate's resume and a set of job descriptions, identify which one of the input jobs is the most suitable for the candidate. The evaluation that has been carried out indicates that the proposed approach outperforms the baselines in the two scenarios. Finally, we provide a use case for candidates where it is possible to recommend courses with the goal to fill certain skills and make them qualified for a certain job.

{"title":"A novel approach for job matching and skill recommendation using transformers and the O*NET database","authors":"Rubén Alonso , Danilo Dessí , Antonello Meloni , Diego Reforgiato Recupero","doi":"10.1016/j.bdr.2025.100509","DOIUrl":"10.1016/j.bdr.2025.100509","url":null,"abstract":"<div><div>Today we have tons of information posted on the web every day regarding job supply and demand which has heavily affected the job market. The online enrolling process has thus become efficient for applicants as it allows them to present their resumes using the Internet and, as such, simultaneously to numerous organizations. Online systems such as Monster.com, OfferZen, and LinkedIn contain millions of job offers and resumes of potential candidates leaving to companies with the hard task to face an enormous amount of data to manage to select the most suitable applicant. The task of assessing the resumes of candidates and providing automatic recommendations on which one suits a particular position best has, therefore, become essential to speed up the hiring process. Similarly, it is important to help applicants to quickly find a job appropriate to their skills and provide recommendations about what they need to master to become eligible for certain jobs. Our approach lies in this context and proposes a new method to identify skills from candidates' resumes and match resumes with job descriptions. We employed the O*NET database entities related to different skills and abilities required by different jobs; moreover, we leveraged deep learning technologies to compute the semantic similarity between O*NET entities and part of text extracted from candidates' resumes. The ultimate goal is to identify the most suitable job for a certain resume according to the information there contained. We have defined two scenarios: i) given a resume, identify the top O*NET occupations with the highest match with the resume, ii) given a candidate's resume and a set of job descriptions, identify which one of the input jobs is the most suitable for the candidate. The evaluation that has been carried out indicates that the proposed approach outperforms the baselines in the two scenarios. Finally, we provide a use case for candidates where it is possible to recommend courses with the goal to fill certain skills and make them qualified for a certain job.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"39 ","pages":"Article 100509"},"PeriodicalIF":3.5,"publicationDate":"2025-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143377085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Big Data Research

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀