Lan Jiang, Gerardo Vitagliano, Mazhar Hameed, Felix Naumann
{"title":"Aggregation Detection in CSV Files","authors":"Lan Jiang, Gerardo Vitagliano, Mazhar Hameed, Felix Naumann","doi":"10.48786/edbt.2022.10","DOIUrl":null,"url":null,"abstract":"Aggregations are an arithmetic relationship between a single number and a set of numbers. Tables in raw CSV files often include various types of aggregations to summarize data therein. Identifying aggregations in tables can help understand file structures, detect data errors, and normalize tables. However, recognizing aggregations in CSV files is not trivial, as these files often organize information in an ad-hoc manner with aggregations appearing in arbitrary positions and displaying rounding errors. We propose the three-stage approach AggreCol to recognize aggregations of five types: sum, difference, average, division, and relative change. The first stage detects aggregations of each type individually. The second stage uses a set of pruning rules to remove spurious candidates. The last stage employs rules to allow individual detectors to skip specific parts of the file and retrieve more aggregations. We evaluated our approach with two manually annotated datasets, showing that AggreCol is capable of achieving 0.95 precision and recall for 91.1% and 86.3% of the files, respectively. We obtained similar results on an unseen test dataset, proving the generalizability of our proposed techniques.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"1 1","pages":"2:207-2:219"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in database technology : proceedings. International Conference on Extending Database Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48786/edbt.2022.10","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Aggregations are an arithmetic relationship between a single number and a set of numbers. Tables in raw CSV files often include various types of aggregations to summarize data therein. Identifying aggregations in tables can help understand file structures, detect data errors, and normalize tables. However, recognizing aggregations in CSV files is not trivial, as these files often organize information in an ad-hoc manner with aggregations appearing in arbitrary positions and displaying rounding errors. We propose the three-stage approach AggreCol to recognize aggregations of five types: sum, difference, average, division, and relative change. The first stage detects aggregations of each type individually. The second stage uses a set of pruning rules to remove spurious candidates. The last stage employs rules to allow individual detectors to skip specific parts of the file and retrieve more aggregations. We evaluated our approach with two manually annotated datasets, showing that AggreCol is capable of achieving 0.95 precision and recall for 91.1% and 86.3% of the files, respectively. We obtained similar results on an unseen test dataset, proving the generalizability of our proposed techniques.