{"title":"Development of Data Evaluation Benchmark for Data Wrangling Recommendation System","authors":"Yuqing Wang, Anna Fariha","doi":"arxiv-2409.10635","DOIUrl":null,"url":null,"abstract":"CoWrangler is a data-wrangling recommender system designed to streamline data\nprocessing tasks. Recognizing that data processing is often time-consuming and\ncomplex for novice users, we aim to simplify the decision-making process\nregarding the most effective subsequent data operation. By analyzing over\n10,000 Kaggle notebooks spanning approximately 1,000 datasets, we derive\ninsights into common data processing strategies employed by users across\nvarious tasks. This analysis helps us understand how dataset quality influences\nwrangling operations, informing our ongoing efforts to possibly expand our\ndataset sources in the future.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"67 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10635","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
CoWrangler is a data-wrangling recommender system designed to streamline data
processing tasks. Recognizing that data processing is often time-consuming and
complex for novice users, we aim to simplify the decision-making process
regarding the most effective subsequent data operation. By analyzing over
10,000 Kaggle notebooks spanning approximately 1,000 datasets, we derive
insights into common data processing strategies employed by users across
various tasks. This analysis helps us understand how dataset quality influences
wrangling operations, informing our ongoing efforts to possibly expand our
dataset sources in the future.