线条背后的故事折线图是发现数据集的入口

arXiv - CS - Databases Pub Date : 2024-08-18 DOI:arxiv-2408.09506

Daomin Ji, Hui Luo, Zhifeng Bao, J. Shane Culpepper

{"title":"线条背后的故事折线图是发现数据集的入口","authors":"Daomin Ji, Hui Luo, Zhifeng Bao, J. Shane Culpepper","doi":"arxiv-2408.09506","DOIUrl":null,"url":null,"abstract":"Line charts are a valuable tool for data analysis and exploration, distilling\nessential insights from a dataset. However, access to the underlying dataset\nbehind a line chart is rarely readily available. In this paper, we explore a\nnovel dataset discovery problem, dataset discovery via line charts, focusing on\nthe use of line charts as queries to discover datasets within a large data\nrepository that are capable of generating similar line charts. To solve this\nproblem, we propose a novel approach called Fine-grained Cross-modal Relevance\nLearning Model (FCM), which aims to estimate the relevance between a line chart\nand a candidate dataset. To achieve this goal, FCM first employs a visual\nelement extractor to extract informative visual elements, i.e., lines and\ny-ticks, from a line chart. Then, two novel segment-level encoders are adopted\nto learn representations for a line chart and a dataset, preserving\nfine-grained information, followed by a cross-modal matcher to match the\nlearned representations in a fine-grained way. Furthermore, we extend FCM to\nsupport line chart queries generated based on data aggregation. Last, we\npropose a benchmark tailored for this problem since no such dataset exists.\nExtensive evaluation on the new benchmark verifies the effectiveness of our\nproposed method. Specifically, our proposed approach surpasses the best\nbaseline by 30.1% and 41.0% in terms of prec@50 and ndcg@50, respectively.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"2 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The Story Behind the Lines: Line Charts as a Gateway to Dataset Discovery\",\"authors\":\"Daomin Ji, Hui Luo, Zhifeng Bao, J. Shane Culpepper\",\"doi\":\"arxiv-2408.09506\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Line charts are a valuable tool for data analysis and exploration, distilling\\nessential insights from a dataset. However, access to the underlying dataset\\nbehind a line chart is rarely readily available. In this paper, we explore a\\nnovel dataset discovery problem, dataset discovery via line charts, focusing on\\nthe use of line charts as queries to discover datasets within a large data\\nrepository that are capable of generating similar line charts. To solve this\\nproblem, we propose a novel approach called Fine-grained Cross-modal Relevance\\nLearning Model (FCM), which aims to estimate the relevance between a line chart\\nand a candidate dataset. To achieve this goal, FCM first employs a visual\\nelement extractor to extract informative visual elements, i.e., lines and\\ny-ticks, from a line chart. Then, two novel segment-level encoders are adopted\\nto learn representations for a line chart and a dataset, preserving\\nfine-grained information, followed by a cross-modal matcher to match the\\nlearned representations in a fine-grained way. Furthermore, we extend FCM to\\nsupport line chart queries generated based on data aggregation. Last, we\\npropose a benchmark tailored for this problem since no such dataset exists.\\nExtensive evaluation on the new benchmark verifies the effectiveness of our\\nproposed method. Specifically, our proposed approach surpasses the best\\nbaseline by 30.1% and 41.0% in terms of prec@50 and ndcg@50, respectively.\",\"PeriodicalId\":501123,\"journal\":{\"name\":\"arXiv - CS - Databases\",\"volume\":\"2 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Databases\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.09506\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.09506","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

折线图是数据分析和探索的重要工具，能从数据集中提炼出重要的见解。然而，人们很少能随时访问折线图背后的底层数据集。在本文中，我们探讨了一个新的数据集发现问题--通过折线图发现数据集，重点是使用折线图作为查询来发现大型数据存储库中能够生成类似折线图的数据集。为了解决这个问题，我们提出了一种名为细粒度跨模态相关性学习模型（FCM）的新方法，旨在估计折线图与候选数据集之间的相关性。为实现这一目标，FCM 首先使用视觉元素提取器从折线图中提取信息丰富的视觉元素，即线条和y-ticks。然后，采用两个新颖的分段级编码器来学习线形图和数据集的表征，保留细粒度信息，接着采用跨模态匹配器以细粒度方式匹配学习到的表征。此外，我们还将 FCM 扩展到支持基于数据聚合生成的折线图查询。最后，我们提出了一个专门针对这一问题的基准，因为目前还不存在这样的数据集。具体来说，我们提出的方法在prec@50和ndcg@50方面分别比最佳基准高出30.1%和41.0%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

The Story Behind the Lines: Line Charts as a Gateway to Dataset Discovery

Line charts are a valuable tool for data analysis and exploration, distilling essential insights from a dataset. However, access to the underlying dataset behind a line chart is rarely readily available. In this paper, we explore a novel dataset discovery problem, dataset discovery via line charts, focusing on the use of line charts as queries to discover datasets within a large data repository that are capable of generating similar line charts. To solve this problem, we propose a novel approach called Fine-grained Cross-modal Relevance Learning Model (FCM), which aims to estimate the relevance between a line chart and a candidate dataset. To achieve this goal, FCM first employs a visual element extractor to extract informative visual elements, i.e., lines and y-ticks, from a line chart. Then, two novel segment-level encoders are adopted to learn representations for a line chart and a dataset, preserving fine-grained information, followed by a cross-modal matcher to match the learned representations in a fine-grained way. Furthermore, we extend FCM to support line chart queries generated based on data aggregation. Last, we propose a benchmark tailored for this problem since no such dataset exists. Extensive evaluation on the new benchmark verifies the effectiveness of our proposed method. Specifically, our proposed approach surpasses the best baseline by 30.1% and 41.0% in terms of prec@50 and ndcg@50, respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Databases

自引率

0.00%

发文量

期刊最新文献

Development of Data Evaluation Benchmark for Data Wrangling Recommendation System Messy Code Makes Managing ML Pipelines Difficult? Just Let LLMs Rewrite the Code! Fast and Adaptive Bulk Loading of Multidimensional Points Matrix Profile for Anomaly Detection on Multidimensional Time Series Extending predictive process monitoring for collaborative processes