Electrocardiogram (ECG) data is commonly used in clinic to reveal instant status of cardiac electrophysiology, and is related to numerous heart diseases. Efficient similarity search on ECG data can assist diagnosis. However, similarity search on ECG data is different from similarity search on images in that ECG data is a kind of physiological wave data, and that there are no established robust feature extraction methods for these physiological wave data. Thus, we adopt a supervised framework to preserve locality based on label information, while extracting effective features automatically. Experiments on real-life data show the effectiveness and efficiency of the proposed approach FASE.
{"title":"FASE: Feature-Based Similarity Search on ECG Data","authors":"Meng Wu, Lei Li, Hongyan Li","doi":"10.1109/ICBK.2019.00044","DOIUrl":"https://doi.org/10.1109/ICBK.2019.00044","url":null,"abstract":"Electrocardiogram (ECG) data is commonly used in clinic to reveal instant status of cardiac electrophysiology, and is related to numerous heart diseases. Efficient similarity search on ECG data can assist diagnosis. However, similarity search on ECG data is different from similarity search on images in that ECG data is a kind of physiological wave data, and that there are no established robust feature extraction methods for these physiological wave data. Thus, we adopt a supervised framework to preserve locality based on label information, while extracting effective features automatically. Experiments on real-life data show the effectiveness and efficiency of the proposed approach FASE.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123252873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhen Wang, Maohong Fan, S. Muknahallipatna, Chao Lan
This paper considers anomaly detection with multi-view data. Unlike traditional detection on single-view data which identifies anomalies based on inconsistency between instances, multi-view anomaly detection identifies anomalies based on view inconsistency within each instance. Current multi-view detection approaches are mostly unsupervised and transductive. This may have limited performance in many applications, which have labeled normal data and prefer efficient detection on new data. In this paper, we propose an inductive semi-supervised multi-view anomaly detection approach. We design a probabilistic generative model for normal data, which assumes different views of a normal instance are generated from a shared latent factor, conditioned on which the views become independent. We estimate the model by maximizing its likelihood on normal data using the EM algorithm. Then, we apply the model to detect anomalies, which are instances generated with small probabilities. We experiment our approach on nine public data sets under different multi-view anomaly settings, and show it outperforms several state-of-the-art multi-view detection methods.
{"title":"Inductive Multi-view Semi-Supervised Anomaly Detection via Probabilistic Modeling","authors":"Zhen Wang, Maohong Fan, S. Muknahallipatna, Chao Lan","doi":"10.1109/ICBK.2019.00042","DOIUrl":"https://doi.org/10.1109/ICBK.2019.00042","url":null,"abstract":"This paper considers anomaly detection with multi-view data. Unlike traditional detection on single-view data which identifies anomalies based on inconsistency between instances, multi-view anomaly detection identifies anomalies based on view inconsistency within each instance. Current multi-view detection approaches are mostly unsupervised and transductive. This may have limited performance in many applications, which have labeled normal data and prefer efficient detection on new data. In this paper, we propose an inductive semi-supervised multi-view anomaly detection approach. We design a probabilistic generative model for normal data, which assumes different views of a normal instance are generated from a shared latent factor, conditioned on which the views become independent. We estimate the model by maximizing its likelihood on normal data using the EM algorithm. Then, we apply the model to detect anomalies, which are instances generated with small probabilities. We experiment our approach on nine public data sets under different multi-view anomaly settings, and show it outperforms several state-of-the-art multi-view detection methods.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121070423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Co-location pattern mining is one of the hot issues in spatial pattern mining. Similarity measures between co-location patterns can be used to solve problems such as pattern compression, pattern summarization, pattern selection and pattern ordering. Although, many researchers have focused on this issue recently and provided a more concise set of co-location patterns based on these measures. Unfortunately, these measures suffer from various weaknesses, e.g., some measures can only calculate the similarity between super-pattern and sub-pattern while some others require additional domain knowledge. In this paper, we propose a general similarity measure for any two co-location patterns. Firstly, we study the characteristics of the co-location pattern and present a novel representation model based on maximal cliques. Then, two materializations of the maximal clique and the pattern relationship, 0-1 vector and key-value vector, are proposed and discussed in the paper. Moreover, based on the materialization methods, the similarity measure, Vector-Degree, is defined by applying the cosine similarity. Finally, similarity is used to group the patterns by a hierarchical clustering algorithm. The experimental results on both synthetic and real world data sets show the efficiency and effectiveness of our proposed method.
{"title":"Vector-Degree: A General Similarity Measure for Co-location Patterns","authors":"Pingping Wu, Lizhen Wang, Muquan Zou","doi":"10.1109/ICBK.2019.00045","DOIUrl":"https://doi.org/10.1109/ICBK.2019.00045","url":null,"abstract":"Co-location pattern mining is one of the hot issues in spatial pattern mining. Similarity measures between co-location patterns can be used to solve problems such as pattern compression, pattern summarization, pattern selection and pattern ordering. Although, many researchers have focused on this issue recently and provided a more concise set of co-location patterns based on these measures. Unfortunately, these measures suffer from various weaknesses, e.g., some measures can only calculate the similarity between super-pattern and sub-pattern while some others require additional domain knowledge. In this paper, we propose a general similarity measure for any two co-location patterns. Firstly, we study the characteristics of the co-location pattern and present a novel representation model based on maximal cliques. Then, two materializations of the maximal clique and the pattern relationship, 0-1 vector and key-value vector, are proposed and discussed in the paper. Moreover, based on the materialization methods, the similarity measure, Vector-Degree, is defined by applying the cosine similarity. Finally, similarity is used to group the patterns by a hierarchical clustering algorithm. The experimental results on both synthetic and real world data sets show the efficiency and effectiveness of our proposed method.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126152898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Hunt, Binay Dahal, J. Zhan, L. Gewali, Paul Y. Oh, Ritvik Janamsetty, Chanana Kinares, Chanel Koh, Alexis Sanchez, Felix Zhan, Murat Özdemir, Shabnam Waseem, Osman Yolcu
Paraphrase Identification or Natural Language Sentence Matching (NLSM) is one of the important and challenging tasks in Natural Language Processing where the task is to identify if a sentence is a paraphrase of another sentence in a given pair of sentences. Paraphrase of a sentence conveys the same meaning but its structure and the sequence of words varies. It is a challenging task as it is difficult to infer the proper context about a sentence given its short length. Also, coming up with similarity metrics for the inferred context of a pair of sentences is not straightforward as well. Whereas, its applications are numerous. This work explores various machine learning algorithms to model the task and also applies different input encoding scheme. Specifically, we created the models using Logistic Regression, Support Vector Machines, and different architectures of Neural Networks. Among the compared models, as expected, Recurrent Neural Network (RNN) is best suited for our paraphrase identification task. Also, we propose that Plagiarism detection is one of the areas where Paraphrase Identification can be effectively implemented.
{"title":"Machine Learning Models for Paraphrase Identification and its Applications on Plagiarism Detection","authors":"E. Hunt, Binay Dahal, J. Zhan, L. Gewali, Paul Y. Oh, Ritvik Janamsetty, Chanana Kinares, Chanel Koh, Alexis Sanchez, Felix Zhan, Murat Özdemir, Shabnam Waseem, Osman Yolcu","doi":"10.1109/ICBK.2019.00021","DOIUrl":"https://doi.org/10.1109/ICBK.2019.00021","url":null,"abstract":"Paraphrase Identification or Natural Language Sentence Matching (NLSM) is one of the important and challenging tasks in Natural Language Processing where the task is to identify if a sentence is a paraphrase of another sentence in a given pair of sentences. Paraphrase of a sentence conveys the same meaning but its structure and the sequence of words varies. It is a challenging task as it is difficult to infer the proper context about a sentence given its short length. Also, coming up with similarity metrics for the inferred context of a pair of sentences is not straightforward as well. Whereas, its applications are numerous. This work explores various machine learning algorithms to model the task and also applies different input encoding scheme. Specifically, we created the models using Logistic Regression, Support Vector Machines, and different architectures of Neural Networks. Among the compared models, as expected, Recurrent Neural Network (RNN) is best suited for our paraphrase identification task. Also, we propose that Plagiarism detection is one of the areas where Paraphrase Identification can be effectively implemented.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"2013 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127421292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xilun Chen, L. Mathesen, Giulia Pedrielli, K. Candan
Knowledge discovery and decision making through data-and model-driven computer simulation ensembles are increasingly critical in many application domains. However, these simulation ensembles are expensive to obtain. Consequently, given a relatively small simulation budget, one needs to identify a sparse ensemble that includes the most informative simulations to help the effective exploration of the space of input parameters. In this paper, we propose a complicacy-guided parameter space sampling (CPSS) for knowledge discovery with limited simulation budgets, which relies on a novel complicacy-driven guidance mechanism to rank candidate models and a novel rank-stability based parameter space partitioning strategy to identify simulation instances to execute. The advantage of the proposed approach is that, unlike purely fit-based approaches, it avoids extensive simulations in difficult-to-fit regions of the parameter space, if the region can be explained with a much simpler model, requiring fewer simulation samples, even if with a slightly lower fit.
{"title":"Complicacy-Guided Parameter Space Sampling for Knowledge Discovery with Limited Simulation Budgets","authors":"Xilun Chen, L. Mathesen, Giulia Pedrielli, K. Candan","doi":"10.1109/ICBK.2019.00015","DOIUrl":"https://doi.org/10.1109/ICBK.2019.00015","url":null,"abstract":"Knowledge discovery and decision making through data-and model-driven computer simulation ensembles are increasingly critical in many application domains. However, these simulation ensembles are expensive to obtain. Consequently, given a relatively small simulation budget, one needs to identify a sparse ensemble that includes the most informative simulations to help the effective exploration of the space of input parameters. In this paper, we propose a complicacy-guided parameter space sampling (CPSS) for knowledge discovery with limited simulation budgets, which relies on a novel complicacy-driven guidance mechanism to rank candidate models and a novel rank-stability based parameter space partitioning strategy to identify simulation instances to execute. The advantage of the proposed approach is that, unlike purely fit-based approaches, it avoids extensive simulations in difficult-to-fit regions of the parameter space, if the region can be explained with a much simpler model, requiring fewer simulation samples, even if with a slightly lower fit.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125510318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eryn Aguilar, Benjamin Lowe, J. Zhan, L. Gewali, Paul Y. Oh, Jevis Dancel, Deysaree Mamaud, Dorothy Pirosch, Farin Tavacoli, Felix Zhan, Robbie Pearce, Margaret Novack, Hokunani Keehu
Security is a universal concern across a multitude of sectors involved in the transfer and storage of computerized data. In the realm of cryptography, random number generators (RNGs) are integral to the creation of encryption keys that protect private data, and the production of uniform probability outcomes is a revenue source for certain enterprises (most notably the casino industry). Arbitrary thread schedule reconstruction of compare-and-swap operations is used to generate input traces for the Blum-Elias algorithm as a method for constructing random sequences, provided the compare-and-swap operations avoid cache locality. Threads accessing shared memory at the memory controller is a true random source which can be polled indirectly through our algorithm with unlimited parallelism. A theoretical and experimental analysis of the observation and reconstruction algorithm are considered. The quality of the random number generator is experimentally analyzed using two standard test suites, DieHarder and ENT, on three data sets.
{"title":"Highly Parallel Seedless Random Number Generation from Arbitrary Thread Schedule Reconstruction","authors":"Eryn Aguilar, Benjamin Lowe, J. Zhan, L. Gewali, Paul Y. Oh, Jevis Dancel, Deysaree Mamaud, Dorothy Pirosch, Farin Tavacoli, Felix Zhan, Robbie Pearce, Margaret Novack, Hokunani Keehu","doi":"10.1109/ICBK.2019.00009","DOIUrl":"https://doi.org/10.1109/ICBK.2019.00009","url":null,"abstract":"Security is a universal concern across a multitude of sectors involved in the transfer and storage of computerized data. In the realm of cryptography, random number generators (RNGs) are integral to the creation of encryption keys that protect private data, and the production of uniform probability outcomes is a revenue source for certain enterprises (most notably the casino industry). Arbitrary thread schedule reconstruction of compare-and-swap operations is used to generate input traces for the Blum-Elias algorithm as a method for constructing random sequences, provided the compare-and-swap operations avoid cache locality. Threads accessing shared memory at the memory controller is a true random source which can be polled indirectly through our algorithm with unlimited parallelism. A theoretical and experimental analysis of the observation and reconstruction algorithm are considered. The quality of the random number generator is experimentally analyzed using two standard test suites, DieHarder and ENT, on three data sets.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130265248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongchang Wu, Ziyu Guan, Tao Zhi, Wei Zhao, Cai Xu, Hong Han, Yaming Yang
Existing cross-modal retrieval methods are mainly constrained to the bimodal case. When applied to the multi-modal case, we need to train O(K^2) (K: number of modalities) separate models, which is inefficient and unable to exploit common information among multiple modalities. Though some studies focused on learning a common space of multiple modalities for retrieval, they assumed data to be i.i.d. and failed to learn the underlying semantic structure which could be important for retrieval. To tackle this issue, we propose an extensive Adversarial Graph Attention Network for Multi-modal Cross-modal Retrieval (AGAT). AGAT synthesizes a self-attention network (SAT), a graph attention network (GAT) and a multi-modal generative adversarial network (MGAN). The SAT generates high-level embeddings for data items from different modalities, with self-attention capturing feature-level correlations in each modality. The GAT then uses attention to aggregate embeddings of matched items from different modalities to build a common embedding space. The MGAN aims to "cluster" matched embeddings of different modalities in the common space by forcing them to be similar to the aggregation. Finally, we train the common space so that it captures the semantic structure by constraining within-class/between-class distances. Experiments on three datasets show the effectiveness of AGAT.
{"title":"Adversarial Graph Attention Network for Multi-modal Cross-Modal Retrieval","authors":"Hongchang Wu, Ziyu Guan, Tao Zhi, Wei Zhao, Cai Xu, Hong Han, Yaming Yang","doi":"10.1109/ICBK.2019.00043","DOIUrl":"https://doi.org/10.1109/ICBK.2019.00043","url":null,"abstract":"Existing cross-modal retrieval methods are mainly constrained to the bimodal case. When applied to the multi-modal case, we need to train O(K^2) (K: number of modalities) separate models, which is inefficient and unable to exploit common information among multiple modalities. Though some studies focused on learning a common space of multiple modalities for retrieval, they assumed data to be i.i.d. and failed to learn the underlying semantic structure which could be important for retrieval. To tackle this issue, we propose an extensive Adversarial Graph Attention Network for Multi-modal Cross-modal Retrieval (AGAT). AGAT synthesizes a self-attention network (SAT), a graph attention network (GAT) and a multi-modal generative adversarial network (MGAN). The SAT generates high-level embeddings for data items from different modalities, with self-attention capturing feature-level correlations in each modality. The GAT then uses attention to aggregate embeddings of matched items from different modalities to build a common embedding space. The MGAN aims to \"cluster\" matched embeddings of different modalities in the common space by forcing them to be similar to the aggregation. Finally, we train the common space so that it captures the semantic structure by constraining within-class/between-class distances. Experiments on three datasets show the effectiveness of AGAT.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130393826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, patent data analysis has attracted a lot of attention, and patent keyword extraction is a hot problem. Most existing methods for patent keyword extraction are based on the frequency of words without semantic information. In this paper, we propose an Unsupervised Keyword Extraction Method (UKEM) based on Chinese patent clustering. More specifically, we use a Skip-gram model to train word embeddings based on a Chinese patent corpus. Then each patent is represented as a vector called patent vector. These patent vectors are clustered to obtain several cluster centroids. Next, the distance between each word vector in patent abstract and cluster centroid is computed to indicate the semantic importance of this word. The experimental results on several Chinese patent datasets show that the performance of our proposed method is better than several competitive methods.
{"title":"Unsupervised Keyword Extraction Method Based on Chinese Patent Clustering","authors":"Yuxin Xie, Xuegang Hu, Yuhong Zhang, Shi Li","doi":"10.1109/ICBK.2019.00048","DOIUrl":"https://doi.org/10.1109/ICBK.2019.00048","url":null,"abstract":"Recently, patent data analysis has attracted a lot of attention, and patent keyword extraction is a hot problem. Most existing methods for patent keyword extraction are based on the frequency of words without semantic information. In this paper, we propose an Unsupervised Keyword Extraction Method (UKEM) based on Chinese patent clustering. More specifically, we use a Skip-gram model to train word embeddings based on a Chinese patent corpus. Then each patent is represented as a vector called patent vector. These patent vectors are clustered to obtain several cluster centroids. Next, the distance between each word vector in patent abstract and cluster centroid is computed to indicate the semantic importance of this word. The experimental results on several Chinese patent datasets show that the performance of our proposed method is better than several competitive methods.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127735238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
At present, user participation as the main body of the perception system will bring the problems that include consuming user's time, energy and participation costs, and so on. Therefore, giving reasonable feedback and encouragement to user participation itself can effectively improve user's initiative and data quality. Combining data quantity, data distribution and budget constraint together, an improved incentive mechanism of reverse auction is proposed based on the structure of participatory sensing system in this paper. Firstly, to maximize the coverage rate and the number of samples as the optimization goal, a model combining the dynamic reverse auction incentive strategy is designed based on the limited budget of the task provider. Secondly, on the basis of optimizing the results of sample screening, the improved algorithm KDA incentive mechanism based on position information is proposed. The algorithm combines the greedy algorithm to gradually decompose the idea of subproblem optimization, in order to ensure that the optimization results are closer to the final goal. Finally, the algorithm is verified, the experimental results show that the proposed algorithm can improve the sample number and coverage under limited budget constraints, and improve the quality of the best sample set.
{"title":"Research on Incentive Algorithm of Participatory Sensing System Based on Location","authors":"Ziyi Qi, Mingxin Liu, Yanju Liang, Jing Chen","doi":"10.1109/ICBK.2019.00035","DOIUrl":"https://doi.org/10.1109/ICBK.2019.00035","url":null,"abstract":"At present, user participation as the main body of the perception system will bring the problems that include consuming user's time, energy and participation costs, and so on. Therefore, giving reasonable feedback and encouragement to user participation itself can effectively improve user's initiative and data quality. Combining data quantity, data distribution and budget constraint together, an improved incentive mechanism of reverse auction is proposed based on the structure of participatory sensing system in this paper. Firstly, to maximize the coverage rate and the number of samples as the optimization goal, a model combining the dynamic reverse auction incentive strategy is designed based on the limited budget of the task provider. Secondly, on the basis of optimizing the results of sample screening, the improved algorithm KDA incentive mechanism based on position information is proposed. The algorithm combines the greedy algorithm to gradually decompose the idea of subproblem optimization, in order to ensure that the optimization results are closer to the final goal. Finally, the algorithm is verified, the experimental results show that the proposed algorithm can improve the sample number and coverage under limited budget constraints, and improve the quality of the best sample set.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129504530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Frank Madrid, Shima Imani, Ryan Mercer, Zachary Schall-Zimmerman, N. S. Senobari, Eamonn J. Keogh
Many time series analytic tasks can be reduced to discovering and then reasoning about conserved structures, or time series motifs. Recently, the Matrix Profile has emerged as the state-of-the-art for finding time series motifs, allowing the community to efficiently find time series motifs in large datasets. The matrix profile reduced time series motif discovery to a process requiring a single parameter, the length of time series motifs we expect (or wish) to find. In many cases this is a reasonable limitation as the user may utilize out-of-band information or domain knowledge to set this parameter. However, in truly exploratory data mining, a poor choice of this parameter can result in failing to find unexpected and exploitable regularities in the data. In this work, we introduce the Pan Matrix Profile, a new data structure which contains the nearest neighbor information for all subsequences of all lengths. This data structure allows the first truly parameter-free motif discovery algorithm in the literature. The sheer volume of information produced by our representation may be overwhelming; thus, we also introduce a novel visualization tool called the motif-heatmap which allows the users to discover and reason about repeated structures at a glance. We demonstrate our ideas on a diverse set of domains including seismology, bioinformatics, transportation and biology.
{"title":"Matrix Profile XX: Finding and Visualizing Time Series Motifs of All Lengths using the Matrix Profile","authors":"Frank Madrid, Shima Imani, Ryan Mercer, Zachary Schall-Zimmerman, N. S. Senobari, Eamonn J. Keogh","doi":"10.1109/ICBK.2019.00031","DOIUrl":"https://doi.org/10.1109/ICBK.2019.00031","url":null,"abstract":"Many time series analytic tasks can be reduced to discovering and then reasoning about conserved structures, or time series motifs. Recently, the Matrix Profile has emerged as the state-of-the-art for finding time series motifs, allowing the community to efficiently find time series motifs in large datasets. The matrix profile reduced time series motif discovery to a process requiring a single parameter, the length of time series motifs we expect (or wish) to find. In many cases this is a reasonable limitation as the user may utilize out-of-band information or domain knowledge to set this parameter. However, in truly exploratory data mining, a poor choice of this parameter can result in failing to find unexpected and exploitable regularities in the data. In this work, we introduce the Pan Matrix Profile, a new data structure which contains the nearest neighbor information for all subsequences of all lengths. This data structure allows the first truly parameter-free motif discovery algorithm in the literature. The sheer volume of information produced by our representation may be overwhelming; thus, we also introduce a novel visualization tool called the motif-heatmap which allows the users to discover and reason about repeated structures at a glance. We demonstrate our ideas on a diverse set of domains including seismology, bioinformatics, transportation and biology.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115218481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}