We live in an age where everything around us is being created. Data generation rates are so scary, creating pressure to implement costly and straightforward data storage and recovery processes. MapReduce model functionality is used for creating a cluster parallel, distributed algorithm, and large datasets. The MapReduce strategy from Hadoop helps develop a community of non-commercial use to offer a new algorithm for resolving such problems for commercial applications as expected from this working algorithm with insights as a result of disproportionate or discriminatory Hadoop cluster results. Expected results are obtained in the work and the exam conducted under this job; many of them are scheduled to set schedules, match matrices' data positions, clustering before determining to click, and accurate mapping and internal reliability to be closed together to avoid running and execution times. Mapper output and proponents have been implemented, and the map has been used to reduce the function. The execution input key/value pair and output key/value pair have been set. This paper focuses on evaluating this technique for the efficient retrieval of large volumes of data. The technique allows for capabilities to inform a massive database of information, from storage and indexing techniques to the distribution of queries, scalability, and performance in heterogeneous environments. The results show that the proposed work reduces the data processing time by 30%.
{"title":"Replication-Based Query Management for Resource Allocation Using Hadoop and MapReduce over Big Data","authors":"Ankit Kumar;Neeraj Varshney;Surbhi Bhatiya;Kamred Udham Singh","doi":"10.26599/BDMA.2022.9020026","DOIUrl":"10.26599/BDMA.2022.9020026","url":null,"abstract":"We live in an age where everything around us is being created. Data generation rates are so scary, creating pressure to implement costly and straightforward data storage and recovery processes. MapReduce model functionality is used for creating a cluster parallel, distributed algorithm, and large datasets. The MapReduce strategy from Hadoop helps develop a community of non-commercial use to offer a new algorithm for resolving such problems for commercial applications as expected from this working algorithm with insights as a result of disproportionate or discriminatory Hadoop cluster results. Expected results are obtained in the work and the exam conducted under this job; many of them are scheduled to set schedules, match matrices' data positions, clustering before determining to click, and accurate mapping and internal reliability to be closed together to avoid running and execution times. Mapper output and proponents have been implemented, and the map has been used to reduce the function. The execution input key/value pair and output key/value pair have been set. This paper focuses on evaluating this technique for the efficient retrieval of large volumes of data. The technique allows for capabilities to inform a massive database of information, from storage and indexing techniques to the distribution of queries, scalability, and performance in heterogeneous environments. The results show that the proposed work reduces the data processing time by 30%.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 4","pages":"465-477"},"PeriodicalIF":13.6,"publicationDate":"2023-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10233239/10233249.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49356278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-29DOI: 10.26599/BDMA.2022.9020052
Ankit Kumar;Kamred Udham Singh;Manish Kumar
The correct diagnosis of heart disease can save lives, while the incorrect diagnosis can be lethal. The UCI machine learning heart disease dataset compares the results and analyses of various machine learning approaches, including deep learning. We used a dataset with 13 primary characteristics to carry out the research. Support vector machine and logistic regression algorithms are used to process the datasets, and the latter displays the highest accuracy in predicting coronary disease. Python programming is used to process the datasets. Multiple research initiatives have used machine learning to speed up the healthcare sector. We also used conventional machine learning approaches in our investigation to uncover the links between the numerous features available in the dataset and then used them effectively in anticipation of heart infection risks. Using the accuracy and confusion matrix has resulted in some favorable outcomes. To get the best results, the dataset contains certain unnecessary features that are dealt with using isolation logistic regression and Support Vector Machine (SVM) classification.
{"title":"A Clinical Data Analysis Based Diagnostic Systems for Heart Disease Prediction Using Ensemble Method","authors":"Ankit Kumar;Kamred Udham Singh;Manish Kumar","doi":"10.26599/BDMA.2022.9020052","DOIUrl":"10.26599/BDMA.2022.9020052","url":null,"abstract":"The correct diagnosis of heart disease can save lives, while the incorrect diagnosis can be lethal. The UCI machine learning heart disease dataset compares the results and analyses of various machine learning approaches, including deep learning. We used a dataset with 13 primary characteristics to carry out the research. Support vector machine and logistic regression algorithms are used to process the datasets, and the latter displays the highest accuracy in predicting coronary disease. Python programming is used to process the datasets. Multiple research initiatives have used machine learning to speed up the healthcare sector. We also used conventional machine learning approaches in our investigation to uncover the links between the numerous features available in the dataset and then used them effectively in anticipation of heart infection risks. Using the accuracy and confusion matrix has resulted in some favorable outcomes. To get the best results, the dataset contains certain unnecessary features that are dealt with using isolation logistic regression and Support Vector Machine (SVM) classification.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 4","pages":"513-525"},"PeriodicalIF":13.6,"publicationDate":"2023-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10233239/10233243.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42487577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Call for Papers: Special Issue on Edge AI Empowered Giant Model Training","authors":"","doi":"","DOIUrl":"https://doi.org/","url":null,"abstract":"","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 4","pages":"526-526"},"PeriodicalIF":13.6,"publicationDate":"2023-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10233239/10233251.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68007722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-29DOI: 10.26599/BDMA.2022.9020041
Juli Yin;Linfeng Wei;Zhiquan Liu;Xi Yang;Hongliang Sun;Yudan Cheng;Jianbin Mai
With the rapid development of mobile devices, aggregation security and efficiency topics are more important than past in crowd sensing. When collecting large-scale vehicle-provided data, the data transmitted via autonomous networks are publicly accessible to all attackers, which increases the risk of vehicle exposure. So we need to ensure data aggregation security. In addition, low aggregation efficiency will lead to insufficient sensing data, making the data unable to provide data mining services. Aiming at the problem of aggregation security and efficiency in large-scale data collection, this article proposes a data collection mechanism (VDCM) for crowd sensing in vehicular ad hoc networks (VANETs). The mechanism includes two mechanism assumptions and selects appropriate methods to reduce consumption. It selects sub mechanism 1 when there exist very few vehicles or the coalition cannot be formed, otherwise selects sub mechanism 2. Single aggregation is used to collect data in sub mechanism 1. In sub mechanism 2, cooperative vehicles are selected by using coalition formation strategy and auction cooperation agreement, and multi aggregation is used to collect data. Two sub mechanisms use Paillier homomorphic encryption technology to ensure the security of data aggregation. In addition, mechanism supplements the data update and scoring steps to increase the amount of available data. The performance analysis shows that the mechanism proposed in this paper can safely aggregate data and reduce consumption. The simulation results indicate that the proposed mechanism reduces time consumption and increases the amount of available data compared with existing mechanisms.
{"title":"VDCM: A Data Collection Mechanism for Crowd Sensing in Vehicular Ad Hoc Networks","authors":"Juli Yin;Linfeng Wei;Zhiquan Liu;Xi Yang;Hongliang Sun;Yudan Cheng;Jianbin Mai","doi":"10.26599/BDMA.2022.9020041","DOIUrl":"10.26599/BDMA.2022.9020041","url":null,"abstract":"With the rapid development of mobile devices, aggregation security and efficiency topics are more important than past in crowd sensing. When collecting large-scale vehicle-provided data, the data transmitted via autonomous networks are publicly accessible to all attackers, which increases the risk of vehicle exposure. So we need to ensure data aggregation security. In addition, low aggregation efficiency will lead to insufficient sensing data, making the data unable to provide data mining services. Aiming at the problem of aggregation security and efficiency in large-scale data collection, this article proposes a data collection mechanism (VDCM) for crowd sensing in vehicular ad hoc networks (VANETs). The mechanism includes two mechanism assumptions and selects appropriate methods to reduce consumption. It selects sub mechanism 1 when there exist very few vehicles or the coalition cannot be formed, otherwise selects sub mechanism 2. Single aggregation is used to collect data in sub mechanism 1. In sub mechanism 2, cooperative vehicles are selected by using coalition formation strategy and auction cooperation agreement, and multi aggregation is used to collect data. Two sub mechanisms use Paillier homomorphic encryption technology to ensure the security of data aggregation. In addition, mechanism supplements the data update and scoring steps to increase the amount of available data. The performance analysis shows that the mechanism proposed in this paper can safely aggregate data and reduce consumption. The simulation results indicate that the proposed mechanism reduces time consumption and increases the amount of available data compared with existing mechanisms.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 4","pages":"391-403"},"PeriodicalIF":13.6,"publicationDate":"2023-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10233239/10233240.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48342786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Every real-world scenario is now digitally replicated in order to reduce paperwork and human labor costs. Machine Learning (ML) models are also being used to make predictions in these applications. Accurate forecasting requires knowledge of these machine learning models and their distinguishing features. The datasets we use as input for each of these different types of ML models, yielding different results. The choice of an ML model for a dataset is critical. A loan risk model is used to show how ML models for a dataset can be linked together. The purpose of this study is to look into how we could use machine learning to quantify or forecast mortgage credit risk. This phrase refers to the process of evaluating massive amounts of data in order to derive useful information for making decisions in a variety of fields. If credit risk is considered, a method based on an examination of what caused and how mortgage credit risk affected credit defaults during the still-current economic crisis of 2021 will be tried. Various approaches to credit risk calculation will be examined, ranging from the most basic to the most complex. In addition, we will conduct a case study on a sample of mortgage loans and compare the results of three different analytical approaches, logistic regression, decision tree, and gradient boost to see which one produced the most commercially useful insights.
{"title":"AI-Based Hybrid Models for Predicting Loan Risk in the Banking Sector","authors":"Vikas Kumar;Shaiku Shahida Saheb;Preeti;Atif Ghayas;Sunil Kumari;Jai Kishan Chandel;Saroj Kumar Pandey;Santosh Kumar","doi":"10.26599/BDMA.2022.9020037","DOIUrl":"10.26599/BDMA.2022.9020037","url":null,"abstract":"Every real-world scenario is now digitally replicated in order to reduce paperwork and human labor costs. Machine Learning (ML) models are also being used to make predictions in these applications. Accurate forecasting requires knowledge of these machine learning models and their distinguishing features. The datasets we use as input for each of these different types of ML models, yielding different results. The choice of an ML model for a dataset is critical. A loan risk model is used to show how ML models for a dataset can be linked together. The purpose of this study is to look into how we could use machine learning to quantify or forecast mortgage credit risk. This phrase refers to the process of evaluating massive amounts of data in order to derive useful information for making decisions in a variety of fields. If credit risk is considered, a method based on an examination of what caused and how mortgage credit risk affected credit defaults during the still-current economic crisis of 2021 will be tried. Various approaches to credit risk calculation will be examined, ranging from the most basic to the most complex. In addition, we will conduct a case study on a sample of mortgage loans and compare the results of three different analytical approaches, logistic regression, decision tree, and gradient boost to see which one produced the most commercially useful insights.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 4","pages":"478-490"},"PeriodicalIF":13.6,"publicationDate":"2023-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10233239/10233246.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43857463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-29DOI: 10.26599/BDMA.2022.9020043
Xiaodong Qu;Chengcheng Guan;Gang Xie;Zhiyi Tian;Keshav Sood;Chaoli Sun;Lei Cui
Accurate load forecasting is critical for electricity production, transmission, and maintenance. Deep learning (DL) model has replaced other classical models as the most popular prediction models. However, the deep prediction model requires users to provide a large amount of private electricity consumption data, which has potential privacy risks. Edge nodes can federally train a global model through aggregation using federated learning (FL). As a novel distributed machine learning (ML) technique, it only exchanges model parameters without sharing raw data. However, existing forecasting methods based on FL still face challenges from data heterogeneity and privacy disclosure. Accordingly, we propose a user-level load forecasting system based on personalized federated learning (PFL) to address these issues. The obtained personalized model outperforms the global model on local data. Further, we introduce a novel differential privacy (DP) algorithm in the proposed system to provide an additional privacy guarantee. Based on the principle of generative adversarial network (GAN), the algorithm achieves the balance between privacy and prediction accuracy throughout the game. We perform simulation experiments on the real-world dataset and the experimental results show that the proposed system can comply with the requirement for accuracy and privacy in real load forecasting scenarios.
{"title":"Personalized Federated Learning for Heterogeneous Residential Load Forecasting","authors":"Xiaodong Qu;Chengcheng Guan;Gang Xie;Zhiyi Tian;Keshav Sood;Chaoli Sun;Lei Cui","doi":"10.26599/BDMA.2022.9020043","DOIUrl":"10.26599/BDMA.2022.9020043","url":null,"abstract":"Accurate load forecasting is critical for electricity production, transmission, and maintenance. Deep learning (DL) model has replaced other classical models as the most popular prediction models. However, the deep prediction model requires users to provide a large amount of private electricity consumption data, which has potential privacy risks. Edge nodes can federally train a global model through aggregation using federated learning (FL). As a novel distributed machine learning (ML) technique, it only exchanges model parameters without sharing raw data. However, existing forecasting methods based on FL still face challenges from data heterogeneity and privacy disclosure. Accordingly, we propose a user-level load forecasting system based on personalized federated learning (PFL) to address these issues. The obtained personalized model outperforms the global model on local data. Further, we introduce a novel differential privacy (DP) algorithm in the proposed system to provide an additional privacy guarantee. Based on the principle of generative adversarial network (GAN), the algorithm achieves the balance between privacy and prediction accuracy throughout the game. We perform simulation experiments on the real-world dataset and the experimental results show that the proposed system can comply with the requirement for accuracy and privacy in real load forecasting scenarios.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 4","pages":"421-432"},"PeriodicalIF":13.6,"publicationDate":"2023-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10233239/10233242.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48886250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-29DOI: 10.26599/BDMA.2022.9020050
Mengmeng Yang;Longxia Huang;Chenghua Tang
With the development of information technology, a mass of data are generated every day. Collecting and analysing these data help service providers improve their services and gain an advantage in the fierce market competition. K-means clustering has been widely used for cluster analysis in real life. However, these analyses are based on users' data, which disclose users' privacy. Local differential privacy has attracted lots of attention recently due to its strong privacy guarantee and has been applied for clustering analysis. However, existing $K$