Pub Date : 2024-03-19DOI: 10.1007/s40745-024-00522-7
Mathew P. M. Ashlin, P. G. Sankaran, E. P. Sreedevi
Panel count data refers to the information collected in studies focusing on recurrent events, where subjects are observed only at specific time points. If these study subjects are exposed to recurrent events of several types, we obtain panel count data with multiple modes of recurrence. In this article, we present a novel method based on generalized estimating equations for the regression analysis of panel count data exposed to multiple modes of recurrence. A cause specific proportional mean model is developed to analyze the effect of covariates on the underlying counting process due to multiple modes of recurrence. We conduct a detailed investigation on the joint estimation of baseline cumulative mean functions and regression parameters. Simulation studies are carried out to evaluate the finite sample performance of the proposed estimators. The procedures are applied to two real data sets, to demonstrate the practical utility.
{"title":"Semiparametric Regression Analysis of Panel Count Data with Multiple Modes of Recurrence","authors":"Mathew P. M. Ashlin, P. G. Sankaran, E. P. Sreedevi","doi":"10.1007/s40745-024-00522-7","DOIUrl":"10.1007/s40745-024-00522-7","url":null,"abstract":"<div><p>Panel count data refers to the information collected in studies focusing on recurrent events, where subjects are observed only at specific time points. If these study subjects are exposed to recurrent events of several types, we obtain panel count data with multiple modes of recurrence. In this article, we present a novel method based on generalized estimating equations for the regression analysis of panel count data exposed to multiple modes of recurrence. A cause specific proportional mean model is developed to analyze the effect of covariates on the underlying counting process due to multiple modes of recurrence. We conduct a detailed investigation on the joint estimation of baseline cumulative mean functions and regression parameters. Simulation studies are carried out to evaluate the finite sample performance of the proposed estimators. The procedures are applied to two real data sets, to demonstrate the practical utility.\u0000</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"12 2","pages":"571 - 590"},"PeriodicalIF":0.0,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140228641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-08DOI: 10.1007/s40745-024-00524-5
Asmita Deshmukh, Anjali Raut
In this research, we introduce an innovative automated resume screening approach that leverages advanced Natural Language Processing (NLP) technology, specifically the Bidirectional Encoder Representations from Transformers (BERT) language model by Google. Our methodology involved collecting 200 resumes from participants with their consent and obtaining ten job descriptions from glassdoor.com for testing. We extracted keywords from the resumes, identified skill sets, and ranked them to focus on crucial attributes. After removing stop words and punctuation, we selected top keywords for analysis. To ensure data precision, we employed stemming and lemmatization to correct tense and meaning. Using the preinstalled BERT model and tokenizer, we generated feature vectors for job descriptions and resume keywords. Our key findings include the calculation of the highest similarity index for each resume, which enabled us to shortlist the most relevant candidates. Notably, the similarity index could reach up to 0.3, and the resume screening speed could reach 1 resume per second. The application of BERT-based NLP techniques significantly improved screening efficiency and accuracy, streamlining talent acquisition and providing valuable insights to HR personnel for informed decision-making. This study underscores the transformative potential of BERT in revolutionizing recruitment through scalable and powerful automated resume screening, demonstrating its efficacy in enhancing the precision and speed of candidate selection.
{"title":"Applying BERT-Based NLP for Automated Resume Screening and Candidate Ranking","authors":"Asmita Deshmukh, Anjali Raut","doi":"10.1007/s40745-024-00524-5","DOIUrl":"10.1007/s40745-024-00524-5","url":null,"abstract":"<div><p>In this research, we introduce an innovative automated resume screening approach that leverages advanced Natural Language Processing (NLP) technology, specifically the Bidirectional Encoder Representations from Transformers (BERT) language model by Google. Our methodology involved collecting 200 resumes from participants with their consent and obtaining ten job descriptions from glassdoor.com for testing. We extracted keywords from the resumes, identified skill sets, and ranked them to focus on crucial attributes. After removing stop words and punctuation, we selected top keywords for analysis. To ensure data precision, we employed stemming and lemmatization to correct tense and meaning. Using the preinstalled BERT model and tokenizer, we generated feature vectors for job descriptions and resume keywords. Our key findings include the calculation of the highest similarity index for each resume, which enabled us to shortlist the most relevant candidates. Notably, the similarity index could reach up to 0.3, and the resume screening speed could reach 1 resume per second. The application of BERT-based NLP techniques significantly improved screening efficiency and accuracy, streamlining talent acquisition and providing valuable insights to HR personnel for informed decision-making. This study underscores the transformative potential of BERT in revolutionizing recruitment through scalable and powerful automated resume screening, demonstrating its efficacy in enhancing the precision and speed of candidate selection.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"12 2","pages":"591 - 603"},"PeriodicalIF":0.0,"publicationDate":"2024-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143809281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-27DOI: 10.1007/s40745-024-00514-7
Mohammed S. Kotb, Haidy A. Newer, Marwa M. Mohie El-Din
Recently, ranked set samples schemes have become quite popular in reliability analysis and life-testing problems. Based on ordered ranked set sample, the Bayesian estimators and credible intervals for the entropy of the Rayleigh model are studied and compared with the corresponding estimators based on simple random sampling. These Bayes estimators for entropy are developed and computed with various loss functions, such as square error, linear-exponential, Al-Bayyati, and general entropy loss functions. A comparison study for various estimates of entropy based on mean squared error is done. A real-life data set and simulation are applied to illustrate our procedures.
{"title":"Bayesian Inference for the Entropy of the Rayleigh Model Based on Ordered Ranked Set Sampling","authors":"Mohammed S. Kotb, Haidy A. Newer, Marwa M. Mohie El-Din","doi":"10.1007/s40745-024-00514-7","DOIUrl":"10.1007/s40745-024-00514-7","url":null,"abstract":"<div><p>Recently, ranked set samples schemes have become quite popular in reliability analysis and life-testing problems. Based on ordered ranked set sample, the Bayesian estimators and credible intervals for the entropy of the Rayleigh model are studied and compared with the corresponding estimators based on simple random sampling. These Bayes estimators for entropy are developed and computed with various loss functions, such as square error, linear-exponential, Al-Bayyati, and general entropy loss functions. A comparison study for various estimates of entropy based on mean squared error is done. A real-life data set and simulation are applied to illustrate our procedures.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"11 4","pages":"1435 - 1458"},"PeriodicalIF":0.0,"publicationDate":"2024-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140427345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-27DOI: 10.1007/s40745-024-00519-2
Mahdi Mollakazemiha, Ehsan Bahrami Samani
Traditionally, in cognitive modeling for binary decision-making tasks, stochastic differential equations, particularly a family of diffusion decision models, are applied. These models suffer from difficulties in parameter estimation and forecasting due to the non-existence of analytical solutions for the differential equations. In this paper, we introduce a joint latent variable model for binary decision-making tasks and reaction time outcomes. Additionally, accelerated Failure Time models can be used for the analysis of reaction time to estimate the effects of covariates on acceleration/deceleration of the survival time. A full likelihood-based approach is used to obtain maximum likelihood estimates of the parameters of the model.To illustrate the utility of the proposed models, a simulation study and real data are analyzed.
{"title":"A Joint Cognitive Latent Variable Model for Binary Decision-making Tasks and Reaction Time Outcomes","authors":"Mahdi Mollakazemiha, Ehsan Bahrami Samani","doi":"10.1007/s40745-024-00519-2","DOIUrl":"10.1007/s40745-024-00519-2","url":null,"abstract":"<div><p>Traditionally, in cognitive modeling for binary decision-making tasks, stochastic differential equations, particularly a family of diffusion decision models, are applied. These models suffer from difficulties in parameter estimation and forecasting due to the non-existence of analytical solutions for the differential equations. In this paper, we introduce a joint latent variable model for binary decision-making tasks and reaction time outcomes. Additionally, accelerated Failure Time models can be used for the analysis of reaction time to estimate the effects of covariates on acceleration/deceleration of the survival time. A full likelihood-based approach is used to obtain maximum likelihood estimates of the parameters of the model.To illustrate the utility of the proposed models, a simulation study and real data are analyzed.\u0000</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"12 2","pages":"499 - 516"},"PeriodicalIF":0.0,"publicationDate":"2024-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140424420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-15DOI: 10.1007/s40745-024-00518-3
Changlin Wang, Zhixia Yang, Junyou Ye, Xue Yang, Manchen Ding
This paper proposes a supervised kernel-free quadratic surface regression method for feature selection (QSR-FS). The method is to find a quadratic function in each class and incorporates it into the least squares loss function. The (l_{2,1})-norm regularization term is introduced to obtain a sparse solution, and a feature weight vector is constructed by the coefficients of the quadratic functions in all classes to explain the importance of each feature. An alternating iteration algorithm is designed to solve the optimization problem of this model. The computational complexity of the algorithm is provided, and the iterative formula is reformulated to further accelerate computation. In the experimental part, feature selection and its downstream classification tasks are performed on eight datasets from different domains, and the experimental results are analyzed by relevant evaluation index. Furthermore, feature selection interpretability and parameter sensitivity analysis are provided. The experimental results demonstrate the feasibility and effectiveness of our method.
{"title":"Supervised Feature Selection via Quadratic Surface Regression with (l_{2,1})-Norm Regularization","authors":"Changlin Wang, Zhixia Yang, Junyou Ye, Xue Yang, Manchen Ding","doi":"10.1007/s40745-024-00518-3","DOIUrl":"10.1007/s40745-024-00518-3","url":null,"abstract":"<div><p>This paper proposes a supervised kernel-free quadratic surface regression method for feature selection (QSR-FS). The method is to find a quadratic function in each class and incorporates it into the least squares loss function. The <span>(l_{2,1})</span>-norm regularization term is introduced to obtain a sparse solution, and a feature weight vector is constructed by the coefficients of the quadratic functions in all classes to explain the importance of each feature. An alternating iteration algorithm is designed to solve the optimization problem of this model. The computational complexity of the algorithm is provided, and the iterative formula is reformulated to further accelerate computation. In the experimental part, feature selection and its downstream classification tasks are performed on eight datasets from different domains, and the experimental results are analyzed by relevant evaluation index. Furthermore, feature selection interpretability and parameter sensitivity analysis are provided. The experimental results demonstrate the feasibility and effectiveness of our method.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"11 2","pages":"647 - 675"},"PeriodicalIF":0.0,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139836443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-15DOI: 10.1007/s40745-024-00516-5
Shahid Mohammad, Isabel Mendoza
This paper introduces a new family of distributions called the hyperbolic tangent (HT) family. The cumulative distribution function of this model is defined using the standard hyperbolic tangent function. The fundamental properties of the distribution are thoroughly examined and presented. Additionally, an inverse exponential distribution is employed as a sub-model within the HT family, and its properties are also derived. The parameters of the HT family are estimated using the maximum likelihood method, and the performance of these estimators is assessed using a simulation approach. To demonstrate the significance and flexibility of the newly introduced family of distributions, two real data sets are utilized. These data sets serve as practical examples that showcase the applicability and usefulness of the HT family in real-world scenarios. By introducing the HT family, exploring its properties, employing the maximum likelihood estimation, and conducting simulations and real data analyses, this paper contributes to the advancement of statistical modeling and distribution theory.
{"title":"A New Hyperbolic Tangent Family of Distributions: Properties and Applications","authors":"Shahid Mohammad, Isabel Mendoza","doi":"10.1007/s40745-024-00516-5","DOIUrl":"10.1007/s40745-024-00516-5","url":null,"abstract":"<div><p>This paper introduces a new family of distributions called the hyperbolic tangent (HT) family. The cumulative distribution function of this model is defined using the standard hyperbolic tangent function. The fundamental properties of the distribution are thoroughly examined and presented. Additionally, an inverse exponential distribution is employed as a sub-model within the HT family, and its properties are also derived. The parameters of the HT family are estimated using the maximum likelihood method, and the performance of these estimators is assessed using a simulation approach. To demonstrate the significance and flexibility of the newly introduced family of distributions, two real data sets are utilized. These data sets serve as practical examples that showcase the applicability and usefulness of the HT family in real-world scenarios. By introducing the HT family, exploring its properties, employing the maximum likelihood estimation, and conducting simulations and real data analyses, this paper contributes to the advancement of statistical modeling and distribution theory.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"12 2","pages":"457 - 480"},"PeriodicalIF":0.0,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139835200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-14DOI: 10.1007/s40745-024-00517-4
Anupam Dutta
The main objective of this paper is to forecast the realized volatility (RV) of Bitcoin futures (BTCF) market. To serve our purpose, we propose an augmented heterogenous autoregressive (HAR) model to consider the information on time-varying jumps observed in BTCF returns. Specifically, we estimate the jump-induced volatility using the GARCH-jump process and then consider this information in the HAR model. Both the in-sample and out-of-sample analyses show that jumps offer added information which is not provided by the existing HAR models. In addition, a novel finding is that the jump-induced volatility offers incremental information relative to the Bitcoin implied volatility index. In sum, our results indicate that the HAR-RV process comprising the leverage effects and jump volatility would predict the RV more precisely compared to the standard HAR-type models. These findings have important implications to cryptocurrency investors.
{"title":"Assessing the Risk of Bitcoin Futures Market: New Evidence","authors":"Anupam Dutta","doi":"10.1007/s40745-024-00517-4","DOIUrl":"10.1007/s40745-024-00517-4","url":null,"abstract":"<div><p>The main objective of this paper is to forecast the realized volatility (RV) of Bitcoin futures (BTCF) market. To serve our purpose, we propose an augmented heterogenous autoregressive (HAR) model to consider the information on time-varying jumps observed in BTCF returns. Specifically, we estimate the jump-induced volatility using the GARCH-jump process and then consider this information in the HAR model. Both the in-sample and out-of-sample analyses show that jumps offer added information which is not provided by the existing HAR models. In addition, a novel finding is that the jump-induced volatility offers incremental information relative to the Bitcoin implied volatility index. In sum, our results indicate that the HAR-RV process comprising the leverage effects and jump volatility would predict the RV more precisely compared to the standard HAR-type models. These findings have important implications to cryptocurrency investors.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"12 2","pages":"481 - 497"},"PeriodicalIF":0.0,"publicationDate":"2024-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s40745-024-00517-4.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139778431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-13DOI: 10.1007/s40745-024-00515-6
Shamshad Ur Rasool, M. A. Lone, S. P. Ahmad
In this paper, we propose and investigate a novel approach for generating the probability distributions. The novel method is known as the SMP transformation technique. By using the SMP Transformation technique, we have developed a new model of the Lomax distribution known as SMP Lomax (SMPL) distribution. The SMPL distribution, which is comparable to the Sine Power Lomax distribution, Power Length BiasedWeighted Lomax Distribution, Exponentiated Lomax and Lomax distribution have the desirable attribute of allowing the superiority and the flexibility over other well known existing models. Furthermore, the research article examines various aspects related to the SMPL , including the statistical properties along with the maximum likelihood estimation procedure to estimate the parameters. An extensive simulation study is carried out to illustrate the behaviour of MLEs on the basis of Mean Square Errors. To evaluate the effectiveness and flexibility of the proposed distribution, two real-life data sets are employed and it is observed that SMPL outperforms base model of Lomax distribution as well as other mentioned competing models based on Akaike Information Criterion, Akaike Information criterion Corrected, Hannan–Quinn information criterion and other goodness of fit measures.
{"title":"An Innovative Technique for Generating Probability Distributions: A Study on Lomax Distribution with Applications in Medical and Engineering Fields","authors":"Shamshad Ur Rasool, M. A. Lone, S. P. Ahmad","doi":"10.1007/s40745-024-00515-6","DOIUrl":"10.1007/s40745-024-00515-6","url":null,"abstract":"<div><p>In this paper, we propose and investigate a novel approach for generating the probability distributions. The novel method is known as the SMP transformation technique. By using the SMP Transformation technique, we have developed a new model of the Lomax distribution known as SMP Lomax (SMPL) distribution. The SMPL distribution, which is comparable to the Sine Power Lomax distribution, Power Length BiasedWeighted Lomax Distribution, Exponentiated Lomax and Lomax distribution have the desirable attribute of allowing the superiority and the flexibility over other well known existing models. Furthermore, the research article examines various aspects related to the SMPL , including the statistical properties along with the maximum likelihood estimation procedure to estimate the parameters. An extensive simulation study is carried out to illustrate the behaviour of MLEs on the basis of Mean Square Errors. To evaluate the effectiveness and flexibility of the proposed distribution, two real-life data sets are employed and it is observed that SMPL outperforms base model of Lomax distribution as well as other mentioned competing models based on Akaike Information Criterion, Akaike Information criterion Corrected, Hannan–Quinn information criterion and other goodness of fit measures.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"12 2","pages":"439 - 455"},"PeriodicalIF":0.0,"publicationDate":"2024-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139782016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-30DOI: 10.1007/s40745-024-00513-8
Sher Chhetri, Hongwei Long, Cory Ball
In finance, various stochastic models have been used to describe price movements of financial instruments. Following the seminal work of Robert Merton, several jump-diffusion models have been proposed for option pricing and risk management. In this study, we augment the process related to the dynamics of log returns in the Black–Scholes model by incorporating alpha-stable Lévy motion with constant volatility. We employ the sample characteristic function approach to investigate parameter estimation for discretely observed stochastic differential equations driven by Lévy noises. Furthermore, we discuss the consistency and asymptotic properties of the proposed estimators and establish a Central Limit Theorem. To further demonstrate the validity of the estimators, we present simulation results for the model. The utility of the proposed model is demonstrated using the Dow Jones Industrial Average data, and all parameters involved in the model are estimated. In addition, we delved into the broader implications of our work, discussing the relevance of our methods to big data-driven research, particularly in the fields of financial data modeling and climate models. We also highlight the importance of optimization and data mining in these contexts, referencing key works in the field. This study thus contributes to the specific area of finance and beyond to the wider scientific community engaged in data science research and analysis.
{"title":"Parameter Estimation for Geometric Lévy Processes with Constant Volatility","authors":"Sher Chhetri, Hongwei Long, Cory Ball","doi":"10.1007/s40745-024-00513-8","DOIUrl":"10.1007/s40745-024-00513-8","url":null,"abstract":"<div><p>In finance, various stochastic models have been used to describe price movements of financial instruments. Following the seminal work of Robert Merton, several jump-diffusion models have been proposed for option pricing and risk management. In this study, we augment the process related to the dynamics of log returns in the Black–Scholes model by incorporating alpha-stable Lévy motion with constant volatility. We employ the sample characteristic function approach to investigate parameter estimation for discretely observed stochastic differential equations driven by Lévy noises. Furthermore, we discuss the consistency and asymptotic properties of the proposed estimators and establish a Central Limit Theorem. To further demonstrate the validity of the estimators, we present simulation results for the model. The utility of the proposed model is demonstrated using the Dow Jones Industrial Average data, and all parameters involved in the model are estimated. In addition, we delved into the broader implications of our work, discussing the relevance of our methods to big data-driven research, particularly in the fields of financial data modeling and climate models. We also highlight the importance of optimization and data mining in these contexts, referencing key works in the field. This study thus contributes to the specific area of finance and beyond to the wider scientific community engaged in data science research and analysis.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"12 1","pages":"63 - 93"},"PeriodicalIF":0.0,"publicationDate":"2024-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140485101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-30DOI: 10.1007/s40745-023-00508-x
Redouane Benabdallah Benarmas, Kadda Beghdad Bey
Deep learning prediction models have emerged as the most widely used for the development of intelligent transportation systems (ITS), and their success is strongly reliant on the volume and quality of training data. However, traffic datasets are often small due to the limitations of the resources used to collect and store traffic flow data. Data Augmentation (DA) is a key method to improve the amount of the training dataset before applying a prediction model. In this paper, we demonstrate the effectiveness of data augmentation for predicting traffic speed by using a Deep Generative Model-based approach (DGM). We empirically evaluate the ability of time series-appropriate architectures to improve traffic prediction over a Train on Synthetic Test on Real(TSTR) process. A Time Series-based Generative Adversarial Network model is used to transform an original road traffic dataset into a synthetic dataset to improve traffic prediction. Experiments were carried out using the 6th Beijing and PeMS datasets to show that the transformation improves the prediction model’s accuracy using both parametric and non-parametric methods. Original datasets are compared with the generated ones using statistical analysis methods to measure the fidelity and behavior of the produced data.
{"title":"Improving Road Traffic Speed Prediction Using Data Augmentation: A Deep Generative Models-based Approach","authors":"Redouane Benabdallah Benarmas, Kadda Beghdad Bey","doi":"10.1007/s40745-023-00508-x","DOIUrl":"10.1007/s40745-023-00508-x","url":null,"abstract":"<div><p>Deep learning prediction models have emerged as the most widely used for the development of intelligent transportation systems (ITS), and their success is strongly reliant on the volume and quality of training data. However, traffic datasets are often small due to the limitations of the resources used to collect and store traffic flow data. Data Augmentation (DA) is a key method to improve the amount of the training dataset before applying a prediction model. In this paper, we demonstrate the effectiveness of data augmentation for predicting traffic speed by using a Deep Generative Model-based approach (DGM). We empirically evaluate the ability of time series-appropriate architectures to improve traffic prediction over a Train on Synthetic Test on Real(TSTR) process. A Time Series-based Generative Adversarial Network model is used to transform an original road traffic dataset into a synthetic dataset to improve traffic prediction. Experiments were carried out using the 6th Beijing and PeMS datasets to show that the transformation improves the prediction model’s accuracy using both parametric and non-parametric methods. Original datasets are compared with the generated ones using statistical analysis methods to measure the fidelity and behavior of the produced data.\u0000</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"11 6","pages":"2199 - 2216"},"PeriodicalIF":0.0,"publicationDate":"2024-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140485444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}