加速非结构化数据流的聚合查询

Proc. VLDB Endow. Pub Date : 2023-07-01 DOI:10.14778/3611479.3611496

Matthew Russo, Tatsunori B. Hashimoto, Daniel Kang, Yi Sun, M. Zaharia

{"title":"加速非结构化数据流的聚合查询","authors":"Matthew Russo, Tatsunori B. Hashimoto, Daniel Kang, Yi Sun, M. Zaharia","doi":"10.14778/3611479.3611496","DOIUrl":null,"url":null,"abstract":"Analysts and scientists are interested in querying streams of video, audio, and text to extract quantitative insights. For example, an urban planner may wish to measure congestion by querying the live feed from a traffic camera. Prior work has used deep neural networks (DNNs) to answer such queries in the batch setting. However, much of this work is not suited for the streaming setting because it requires access to the entire dataset before a query can be submitted or is specific to video. Thus, to the best of our knowledge, no prior work addresses the problem of efficiently answering queries over multiple modalities of streams.\n In this work we propose InQuest, a system for accelerating aggregation queries on unstructured streams of data with statistical guarantees on query accuracy. InQuest leverages inexpensive approximation models (\"proxies\") and sampling techniques to limit the execution of an expensive high-precision model (an \"oracle\") to a subset of the stream. It then uses the oracle predictions to compute an approximate query answer in real-time. We theoretically analyzed InQuest and show that the expected error of its query estimates converges on stationary streams at a rate inversely proportional to the oracle budget. We evaluated our algorithm on six real-world video and text datasets and show that InQuest achieves the same root mean squared error (RMSE) as two streaming baselines with up to 5.0x fewer oracle invocations. We further show that InQuest can achieve up to 1.9x lower RMSE at a fixed number of oracle invocations than a state-of-the-art batch setting algorithm.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"46 1","pages":"2897-2910"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Accelerating Aggregation Queries on Unstructured Streams of Data\",\"authors\":\"Matthew Russo, Tatsunori B. Hashimoto, Daniel Kang, Yi Sun, M. Zaharia\",\"doi\":\"10.14778/3611479.3611496\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Analysts and scientists are interested in querying streams of video, audio, and text to extract quantitative insights. For example, an urban planner may wish to measure congestion by querying the live feed from a traffic camera. Prior work has used deep neural networks (DNNs) to answer such queries in the batch setting. However, much of this work is not suited for the streaming setting because it requires access to the entire dataset before a query can be submitted or is specific to video. Thus, to the best of our knowledge, no prior work addresses the problem of efficiently answering queries over multiple modalities of streams.\\n In this work we propose InQuest, a system for accelerating aggregation queries on unstructured streams of data with statistical guarantees on query accuracy. InQuest leverages inexpensive approximation models (\\\"proxies\\\") and sampling techniques to limit the execution of an expensive high-precision model (an \\\"oracle\\\") to a subset of the stream. It then uses the oracle predictions to compute an approximate query answer in real-time. We theoretically analyzed InQuest and show that the expected error of its query estimates converges on stationary streams at a rate inversely proportional to the oracle budget. We evaluated our algorithm on six real-world video and text datasets and show that InQuest achieves the same root mean squared error (RMSE) as two streaming baselines with up to 5.0x fewer oracle invocations. We further show that InQuest can achieve up to 1.9x lower RMSE at a fixed number of oracle invocations than a state-of-the-art batch setting algorithm.\",\"PeriodicalId\":20467,\"journal\":{\"name\":\"Proc. VLDB Endow.\",\"volume\":\"46 1\",\"pages\":\"2897-2910\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proc. VLDB Endow.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.14778/3611479.3611496\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proc. VLDB Endow.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14778/3611479.3611496","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

分析师和科学家对查询视频、音频和文本流以提取定量见解很感兴趣。例如，城市规划者可能希望通过查询来自交通摄像机的实时馈送来测量拥堵情况。先前的工作使用深度神经网络(dnn)来回答批处理设置中的此类查询。然而，大部分工作并不适合流设置，因为它需要在提交查询或特定于视频之前访问整个数据集。因此，据我们所知，没有先前的工作解决了在多模式流上有效回答查询的问题。在这项工作中，我们提出了InQuest，这是一个加速非结构化数据流聚合查询的系统，具有查询准确性的统计保证。InQuest利用廉价的近似模型(“代理”)和抽样技术，将昂贵的高精度模型(“oracle”)的执行限制在流的一个子集上。然后，它使用oracle预测来实时计算一个近似的查询答案。我们从理论上分析了InQuest，并表明它的查询估计的预期误差以与oracle预算成反比的速率收敛在固定流上。我们在六个真实世界的视频和文本数据集上评估了我们的算法，并表明InQuest实现了与两个流基线相同的均方根误差(RMSE)，最多减少了5.0倍的oracle调用。我们进一步表明，在固定的oracle调用次数下，与最先进的批处理设置算法相比，InQuest的RMSE可以降低1.9倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Accelerating Aggregation Queries on Unstructured Streams of Data

Analysts and scientists are interested in querying streams of video, audio, and text to extract quantitative insights. For example, an urban planner may wish to measure congestion by querying the live feed from a traffic camera. Prior work has used deep neural networks (DNNs) to answer such queries in the batch setting. However, much of this work is not suited for the streaming setting because it requires access to the entire dataset before a query can be submitted or is specific to video. Thus, to the best of our knowledge, no prior work addresses the problem of efficiently answering queries over multiple modalities of streams. In this work we propose InQuest, a system for accelerating aggregation queries on unstructured streams of data with statistical guarantees on query accuracy. InQuest leverages inexpensive approximation models ("proxies") and sampling techniques to limit the execution of an expensive high-precision model (an "oracle") to a subset of the stream. It then uses the oracle predictions to compute an approximate query answer in real-time. We theoretically analyzed InQuest and show that the expected error of its query estimates converges on stationary streams at a rate inversely proportional to the oracle budget. We evaluated our algorithm on six real-world video and text datasets and show that InQuest achieves the same root mean squared error (RMSE) as two streaming baselines with up to 5.0x fewer oracle invocations. We further show that InQuest can achieve up to 1.9x lower RMSE at a fixed number of oracle invocations than a state-of-the-art batch setting algorithm.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助