Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping.

KDD : proceedings. International Conference on Knowledge Discovery & Data Mining Pub Date : 2012-08-01 DOI:10.1145/2339530.2339576

Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista, Brandon Westover, Qiang Zhu, Jesin Zakaria, Eamonn Keogh

{"title":"Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping.","authors":"Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista, Brandon Westover, Qiang Zhu, Jesin Zakaria, Eamonn Keogh","doi":"10.1145/2339530.2339576","DOIUrl":null,"url":null,"abstract":"Most time series data mining algorithms use similarity search as a core subroutine, and thus the time taken for similarity search is the bottleneck for virtually all time series data mining algorithms. The difficulty of scaling search to large datasets largely explains why most academic work on time series data mining has plateaued at considering a few millions of time series objects, while much of industry and science sits on billions of time series objects waiting to be explored. In this work we show that by using a combination of four novel ideas we can search and mine truly massive time series for the first time. We demonstrate the following extremely unintuitive fact; in large datasets we can exactly search under DTW much more quickly than the current state-of-the-art Euclidean distance search algorithms. We demonstrate our work on the largest set of time series experiments ever attempted. In particular, the largest dataset we consider is larger than the combined size of all of the time series datasets considered in all data mining papers ever published. We show that our ideas allow us to solve higher-level time series data mining problem such as motif discovery and clustering at scales that would otherwise be untenable. In addition to mining massive datasets, we will show that our ideas also have implications for real-time monitoring of data streams, allowing us to handle much faster arrival rates and/or use cheaper and lower powered devices than are currently possible.","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":"2012 ","pages":"262-270"},"PeriodicalIF":0.0000,"publicationDate":"2012-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6816304/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2339530.2339576","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Most time series data mining algorithms use similarity search as a core subroutine, and thus the time taken for similarity search is the bottleneck for virtually all time series data mining algorithms. The difficulty of scaling search to large datasets largely explains why most academic work on time series data mining has plateaued at considering a few millions of time series objects, while much of industry and science sits on billions of time series objects waiting to be explored. In this work we show that by using a combination of four novel ideas we can search and mine truly massive time series for the first time. We demonstrate the following extremely unintuitive fact; in large datasets we can exactly search under DTW much more quickly than the current state-of-the-art Euclidean distance search algorithms. We demonstrate our work on the largest set of time series experiments ever attempted. In particular, the largest dataset we consider is larger than the combined size of all of the time series datasets considered in all data mining papers ever published. We show that our ideas allow us to solve higher-level time series data mining problem such as motif discovery and clustering at scales that would otherwise be untenable. In addition to mining massive datasets, we will show that our ideas also have implications for real-time monitoring of data streams, allowing us to handle much faster arrival rates and/or use cheaper and lower powered devices than are currently possible.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在动态时间扭曲下搜索和挖掘万亿个时间序列子序列。

大多数时间序列数据挖掘算法都将相似性搜索作为核心子程序，因此相似性搜索所需的时间几乎是所有时间序列数据开采算法的瓶颈。将搜索扩展到大型数据集的困难在很大程度上解释了为什么大多数关于时间序列数据挖掘的学术工作都停留在考虑数百万个时间序列对象上，而工业和科学的大部分都停留在数十亿个等待探索的时间序列对象上。在这项工作中，我们展示了通过使用四个新颖想法的组合，我们可以首次搜索和挖掘真正庞大的时间序列。我们证明了以下极不直观的事实；在大型数据集中，我们可以比当前最先进的欧几里得距离搜索算法更快地在DTW下进行精确搜索。我们展示了我们在有史以来最大的一组时间序列实验中的工作。特别是，我们考虑的最大数据集大于有史以来发表的所有数据挖掘论文中考虑的所有时间序列数据集的总和。我们表明，我们的想法使我们能够解决更高级别的时间序列数据挖掘问题，如主题发现和聚类，否则这些问题将无法解决。除了挖掘海量数据集，我们还将展示我们的想法对数据流的实时监控也有影响，使我们能够处理比目前更快的到达率和/或使用更便宜、更低功耗的设备。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

KDD : proceedings. International Conference on Knowledge Discovery & Data Mining

自引率

0.00%

发文量