{"title":"Supporting Dynamic Quantization for High-Dimensional Data Analytics.","authors":"Gheorghi Guzun, Guadalupe Canahuate","doi":"10.1145/3077331.3077336","DOIUrl":null,"url":null,"abstract":"<p><p>Similarity searches are at the heart of exploratory data analysis tasks. Distance metrics are typically used to characterize the similarity between data objects represented as feature vectors. However, when the dimensionality of the data increases and the number of features is large, traditional distance metrics fail to distinguish between the closest and furthest data points. Localized distance functions have been proposed as an alternative to traditional distance metrics. These functions only consider dimensions close to query to compute the distance/similarity. Furthermore, in order to enable interactive explorations of high-dimensional data, indexing support for ad-hoc queries is needed. In this work we set up to investigate whether bit-sliced indices can be used for exploratory analytics such as similarity searches and data clustering for high-dimensional big-data. We also propose a novel dynamic quantization called Query dependent Equi-Depth (QED) quantization and show its effectiveness on characterizing high-dimensional similarity. When applying QED we observe improvements in kNN classification accuracy over traditional distance functions.</p><p><strong>Acm reference format: </strong>Gheorghi Guzun and Guadalupe Canahuate. 2017. Supporting Dynamic Quantization for High-Dimensional Data Analytics. In Proceedings of Ex-ploreDB'17, Chicago, IL, USA, May 14-19, 2017, 6 pages. https://doi.org/http://dx.doi.org/10.1145/3077331.3077336.</p>","PeriodicalId":92430,"journal":{"name":"Proceedings of the ExploreDB'17. International Workshop on Exploratory Search in Databases and the Web (4th : 2017 : Chicago, Ill.)","volume":"2017 ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3077331.3077336","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ExploreDB'17. International Workshop on Exploratory Search in Databases and the Web (4th : 2017 : Chicago, Ill.)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3077331.3077336","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
Similarity searches are at the heart of exploratory data analysis tasks. Distance metrics are typically used to characterize the similarity between data objects represented as feature vectors. However, when the dimensionality of the data increases and the number of features is large, traditional distance metrics fail to distinguish between the closest and furthest data points. Localized distance functions have been proposed as an alternative to traditional distance metrics. These functions only consider dimensions close to query to compute the distance/similarity. Furthermore, in order to enable interactive explorations of high-dimensional data, indexing support for ad-hoc queries is needed. In this work we set up to investigate whether bit-sliced indices can be used for exploratory analytics such as similarity searches and data clustering for high-dimensional big-data. We also propose a novel dynamic quantization called Query dependent Equi-Depth (QED) quantization and show its effectiveness on characterizing high-dimensional similarity. When applying QED we observe improvements in kNN classification accuracy over traditional distance functions.
Acm reference format: Gheorghi Guzun and Guadalupe Canahuate. 2017. Supporting Dynamic Quantization for High-Dimensional Data Analytics. In Proceedings of Ex-ploreDB'17, Chicago, IL, USA, May 14-19, 2017, 6 pages. https://doi.org/http://dx.doi.org/10.1145/3077331.3077336.
相似性搜索是探索性数据分析任务的核心。距离度量通常用于表示为特征向量的数据对象之间的相似性。然而,当数据的维数增加,特征数量很大时,传统的距离度量无法区分最近和最远的数据点。局部距离函数已被提出作为传统距离度量的替代方法。这些函数只考虑接近查询的维度来计算距离/相似度。此外,为了支持对高维数据的交互式探索,需要对特别查询提供索引支持。在这项工作中,我们开始研究位切片索引是否可以用于探索性分析,如相似性搜索和高维大数据的数据聚类。我们还提出了一种新的动态量化,称为查询相关等深度量化(QED),并证明了它在表征高维相似性方面的有效性。当应用QED时,我们观察到kNN分类精度比传统距离函数有所提高。Acm参考格式:georghi Guzun and Guadalupe canhuate . 2017。支持高维数据分析的动态量化。《Proceedings of Ex-ploreDB’17》,2017年5月14-19日,美国芝加哥,IL, USA, 6页。https://doi.org/http: / / dx.doi.org/10.1145/3077331.3077336。