{"title":"A Comparative Analysis of Apache Spark Dataframes over Resilient Distributed Datasets (RDDs)","authors":"Ashima Sahni","doi":"10.55041/ijsrem36566","DOIUrl":null,"url":null,"abstract":"Apache Spark is a widely used technology now a days for handling huge datasets in applications due to its flexibility, scalability, robustness, speed and integration with multiple programming languages like Java, Scala, Python. It provides multiple methodologies for implementation like dataframes, RDDs with these programming languages. This paper provides a deep overview of Apache spark dataframes usage for performance enhancement over Apache Spark Resilient Distributed Datasets (RDDs) and SQL based data processing. Spark dataframes are widely used in managing and processing large datasets which can be structured, non-structured or semi-structured. This paper describes the approach towards Spark dataframes for performance enhancement for large data processing in place of traditional usage of Spark RDDs with practical examples and use cases. It highlights the key points on why to use dataframes for a better performance achievement for any application where large dataset needs to be processed into a meaningful output.","PeriodicalId":504501,"journal":{"name":"INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT","volume":" 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.55041/ijsrem36566","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Apache Spark is a widely used technology now a days for handling huge datasets in applications due to its flexibility, scalability, robustness, speed and integration with multiple programming languages like Java, Scala, Python. It provides multiple methodologies for implementation like dataframes, RDDs with these programming languages. This paper provides a deep overview of Apache spark dataframes usage for performance enhancement over Apache Spark Resilient Distributed Datasets (RDDs) and SQL based data processing. Spark dataframes are widely used in managing and processing large datasets which can be structured, non-structured or semi-structured. This paper describes the approach towards Spark dataframes for performance enhancement for large data processing in place of traditional usage of Spark RDDs with practical examples and use cases. It highlights the key points on why to use dataframes for a better performance achievement for any application where large dataset needs to be processed into a meaningful output.