线条背后的故事折线图是发现数据集的入口

Daomin Ji, Hui Luo, Zhifeng Bao, J. Shane Culpepper
{"title":"线条背后的故事折线图是发现数据集的入口","authors":"Daomin Ji, Hui Luo, Zhifeng Bao, J. Shane Culpepper","doi":"arxiv-2408.09506","DOIUrl":null,"url":null,"abstract":"Line charts are a valuable tool for data analysis and exploration, distilling\nessential insights from a dataset. However, access to the underlying dataset\nbehind a line chart is rarely readily available. In this paper, we explore a\nnovel dataset discovery problem, dataset discovery via line charts, focusing on\nthe use of line charts as queries to discover datasets within a large data\nrepository that are capable of generating similar line charts. To solve this\nproblem, we propose a novel approach called Fine-grained Cross-modal Relevance\nLearning Model (FCM), which aims to estimate the relevance between a line chart\nand a candidate dataset. To achieve this goal, FCM first employs a visual\nelement extractor to extract informative visual elements, i.e., lines and\ny-ticks, from a line chart. Then, two novel segment-level encoders are adopted\nto learn representations for a line chart and a dataset, preserving\nfine-grained information, followed by a cross-modal matcher to match the\nlearned representations in a fine-grained way. Furthermore, we extend FCM to\nsupport line chart queries generated based on data aggregation. Last, we\npropose a benchmark tailored for this problem since no such dataset exists.\nExtensive evaluation on the new benchmark verifies the effectiveness of our\nproposed method. Specifically, our proposed approach surpasses the best\nbaseline by 30.1% and 41.0% in terms of prec@50 and ndcg@50, respectively.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"2 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The Story Behind the Lines: Line Charts as a Gateway to Dataset Discovery\",\"authors\":\"Daomin Ji, Hui Luo, Zhifeng Bao, J. Shane Culpepper\",\"doi\":\"arxiv-2408.09506\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Line charts are a valuable tool for data analysis and exploration, distilling\\nessential insights from a dataset. However, access to the underlying dataset\\nbehind a line chart is rarely readily available. In this paper, we explore a\\nnovel dataset discovery problem, dataset discovery via line charts, focusing on\\nthe use of line charts as queries to discover datasets within a large data\\nrepository that are capable of generating similar line charts. To solve this\\nproblem, we propose a novel approach called Fine-grained Cross-modal Relevance\\nLearning Model (FCM), which aims to estimate the relevance between a line chart\\nand a candidate dataset. To achieve this goal, FCM first employs a visual\\nelement extractor to extract informative visual elements, i.e., lines and\\ny-ticks, from a line chart. Then, two novel segment-level encoders are adopted\\nto learn representations for a line chart and a dataset, preserving\\nfine-grained information, followed by a cross-modal matcher to match the\\nlearned representations in a fine-grained way. Furthermore, we extend FCM to\\nsupport line chart queries generated based on data aggregation. Last, we\\npropose a benchmark tailored for this problem since no such dataset exists.\\nExtensive evaluation on the new benchmark verifies the effectiveness of our\\nproposed method. Specifically, our proposed approach surpasses the best\\nbaseline by 30.1% and 41.0% in terms of prec@50 and ndcg@50, respectively.\",\"PeriodicalId\":501123,\"journal\":{\"name\":\"arXiv - CS - Databases\",\"volume\":\"2 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Databases\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.09506\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.09506","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

折线图是数据分析和探索的重要工具,能从数据集中提炼出重要的见解。然而,人们很少能随时访问折线图背后的底层数据集。在本文中,我们探讨了一个新的数据集发现问题--通过折线图发现数据集,重点是使用折线图作为查询来发现大型数据存储库中能够生成类似折线图的数据集。为了解决这个问题,我们提出了一种名为细粒度跨模态相关性学习模型(FCM)的新方法,旨在估计折线图与候选数据集之间的相关性。为实现这一目标,FCM 首先使用视觉元素提取器从折线图中提取信息丰富的视觉元素,即线条和y-ticks。然后,采用两个新颖的分段级编码器来学习线形图和数据集的表征,保留细粒度信息,接着采用跨模态匹配器以细粒度方式匹配学习到的表征。此外,我们还将 FCM 扩展到支持基于数据聚合生成的折线图查询。最后,我们提出了一个专门针对这一问题的基准,因为目前还不存在这样的数据集。具体来说,我们提出的方法在prec@50和ndcg@50方面分别比最佳基准高出30.1%和41.0%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
The Story Behind the Lines: Line Charts as a Gateway to Dataset Discovery
Line charts are a valuable tool for data analysis and exploration, distilling essential insights from a dataset. However, access to the underlying dataset behind a line chart is rarely readily available. In this paper, we explore a novel dataset discovery problem, dataset discovery via line charts, focusing on the use of line charts as queries to discover datasets within a large data repository that are capable of generating similar line charts. To solve this problem, we propose a novel approach called Fine-grained Cross-modal Relevance Learning Model (FCM), which aims to estimate the relevance between a line chart and a candidate dataset. To achieve this goal, FCM first employs a visual element extractor to extract informative visual elements, i.e., lines and y-ticks, from a line chart. Then, two novel segment-level encoders are adopted to learn representations for a line chart and a dataset, preserving fine-grained information, followed by a cross-modal matcher to match the learned representations in a fine-grained way. Furthermore, we extend FCM to support line chart queries generated based on data aggregation. Last, we propose a benchmark tailored for this problem since no such dataset exists. Extensive evaluation on the new benchmark verifies the effectiveness of our proposed method. Specifically, our proposed approach surpasses the best baseline by 30.1% and 41.0% in terms of prec@50 and ndcg@50, respectively.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Development of Data Evaluation Benchmark for Data Wrangling Recommendation System Messy Code Makes Managing ML Pipelines Difficult? Just Let LLMs Rewrite the Code! Fast and Adaptive Bulk Loading of Multidimensional Points Matrix Profile for Anomaly Detection on Multidimensional Time Series Extending predictive process monitoring for collaborative processes
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1