Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00132
Zekai Chen, Jiaze E, Xiao Zhang, Hao Sheng, Xiuzhen Cheng
Time series forecasting is a key component in many industrial and business decision processes and recurrent neural network (RNN) based models have achieved impressive progress on various time series forecasting tasks. However, most of the existing methods focus on single-task forecasting problems by learning separately based on limited supervised objectives, which often suffer from insufficient training instances. As the Transformer architecture and other attention-based models have demonstrated its great capability of capturing long term dependency, we propose two self-attention based sharing schemes for multi-task time series forecasting which can train jointly across multiple tasks. We augment a sequence of paralleled Transformer encoders with an external public multi-head attention function, which is updated by all data of all tasks. Experiments on a number of real-world multi-task time series forecasting tasks show that our proposed architectures can not only outperform the state-of-the-art single-task forecasting baselines but also outperform the RNN-based multi-task forecasting method.
{"title":"Multi-Task Time Series Forecasting With Shared Attention","authors":"Zekai Chen, Jiaze E, Xiao Zhang, Hao Sheng, Xiuzhen Cheng","doi":"10.1109/ICDMW51313.2020.00132","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00132","url":null,"abstract":"Time series forecasting is a key component in many industrial and business decision processes and recurrent neural network (RNN) based models have achieved impressive progress on various time series forecasting tasks. However, most of the existing methods focus on single-task forecasting problems by learning separately based on limited supervised objectives, which often suffer from insufficient training instances. As the Transformer architecture and other attention-based models have demonstrated its great capability of capturing long term dependency, we propose two self-attention based sharing schemes for multi-task time series forecasting which can train jointly across multiple tasks. We augment a sequence of paralleled Transformer encoders with an external public multi-head attention function, which is updated by all data of all tasks. Experiments on a number of real-world multi-task time series forecasting tasks show that our proposed architectures can not only outperform the state-of-the-art single-task forecasting baselines but also outperform the RNN-based multi-task forecasting method.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127166986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00009
G. D. Fatta, V. Sheng, A. Cuzzocrea
The 20th IEEE International Conference on Data Mining (ICDM) hosts many co-located workshops, whose papers are traditionally published in a dedicated IEEE CS press proceedings. The purpose of these workshops is to give exposure to the current trends in data mining research, which do not find space or sufficient attention in the main conference tracks either because of the specialised application domain or because of the emerging nature of the field. The aim is to disseminate advances in these emerging fields and to attract more attention and research effort from the community. The quality of the papers and of the final program is guaranteed by an initial selection process of the workshop proposals and by a rigorous peer-review process within each workshop. This volume contains all the papers accepted for publication in the ICDM 2020 workshops and represents an interesting snapshot of data mining methods and applications of emerging and innovative areas of interest. This editorial provides an overview of the workshops included in the final program of ICDM 2020.
{"title":"The IEEE ICDM 2020 Workshops","authors":"G. D. Fatta, V. Sheng, A. Cuzzocrea","doi":"10.1109/ICDMW51313.2020.00009","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00009","url":null,"abstract":"The 20th IEEE International Conference on Data Mining (ICDM) hosts many co-located workshops, whose papers are traditionally published in a dedicated IEEE CS press proceedings. The purpose of these workshops is to give exposure to the current trends in data mining research, which do not find space or sufficient attention in the main conference tracks either because of the specialised application domain or because of the emerging nature of the field. The aim is to disseminate advances in these emerging fields and to attract more attention and research effort from the community. The quality of the papers and of the final program is guaranteed by an initial selection process of the workshop proposals and by a rigorous peer-review process within each workshop. This volume contains all the papers accepted for publication in the ICDM 2020 workshops and represents an interesting snapshot of data mining methods and applications of emerging and innovative areas of interest. This editorial provides an overview of the workshops included in the final program of ICDM 2020.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124866163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The type of song or music a person is listening to, usually represents his state of mind. We find many online music libraries/ repositories and music streaming and media services providers play or store songs or music based on the human emotions. This makes the emotion or mood detection of songs a very interesting study which has got the above mentioned areas of application along with many others. In this paper, we have proposed methods to identify the connotation of any song based on the textual lyrics only. The methods used are the very basic language independent features which show competent performance with F1-score of more than 81%.
{"title":"Textual Lyrics Based Emotion Analysis of Bengali Songs","authors":"Devjyoti Nath, Anirban Roy, Sumitra Kumari Shaw, Amlan Ghorai, Shanta Phani","doi":"10.1109/ICDMW51313.2020.00015","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00015","url":null,"abstract":"The type of song or music a person is listening to, usually represents his state of mind. We find many online music libraries/ repositories and music streaming and media services providers play or store songs or music based on the human emotions. This makes the emotion or mood detection of songs a very interesting study which has got the above mentioned areas of application along with many others. In this paper, we have proposed methods to identify the connotation of any song based on the textual lyrics only. The methods used are the very basic language independent features which show competent performance with F1-score of more than 81%.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131278898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/icdmw51313.2020.00003
{"title":"Copyright","authors":"","doi":"10.1109/icdmw51313.2020.00003","DOIUrl":"https://doi.org/10.1109/icdmw51313.2020.00003","url":null,"abstract":"","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131554862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00020
Sachin Kumar, Garima Gupta, Ranjitha Prasad, Arnab Chatterjee, L. Vig, Gautam M. Shroff
Advertising channels have evolved from conventional print media, billboards and radio-advertising to online digital advertising (ad), where the users are exposed to a sequence of ad campaigns via social networks, display ads, search etc. While advertisers revisit the design of ad campaigns to concurrently serve the requirements emerging out of new ad channels, it is also critical for advertisers to estimate the contribution from touch-points (view, clicks, converts) on different channels, based on the sequence of customer actions. This process of contribution measurement is often referred to as multi-touch attribution (MTA). In this work, we propose CAMTA, a novel deep recurrent neural network architecture which is a causal attribution mechanism for user-personalised MTA in the context of observational data. CAMTA minimizes the selection bias in channel assignment across time-steps and touchpoints. Furthermore, it utilizes the users' pre-conversion actions in a principled way in order to predict per-channel attribution. To quantitatively benchmark the proposed MTA model, we employ the real-world Criteo dataset and demonstrate the superior performance of CAMTA with respect to prediction accuracy as compared to several baselines. In addition, we provide results for budget allocation and user-behaviour modeling on the predicted channel attribution.
{"title":"CAMTA: Causal Attention Model for Multi-touch Attribution","authors":"Sachin Kumar, Garima Gupta, Ranjitha Prasad, Arnab Chatterjee, L. Vig, Gautam M. Shroff","doi":"10.1109/ICDMW51313.2020.00020","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00020","url":null,"abstract":"Advertising channels have evolved from conventional print media, billboards and radio-advertising to online digital advertising (ad), where the users are exposed to a sequence of ad campaigns via social networks, display ads, search etc. While advertisers revisit the design of ad campaigns to concurrently serve the requirements emerging out of new ad channels, it is also critical for advertisers to estimate the contribution from touch-points (view, clicks, converts) on different channels, based on the sequence of customer actions. This process of contribution measurement is often referred to as multi-touch attribution (MTA). In this work, we propose CAMTA, a novel deep recurrent neural network architecture which is a causal attribution mechanism for user-personalised MTA in the context of observational data. CAMTA minimizes the selection bias in channel assignment across time-steps and touchpoints. Furthermore, it utilizes the users' pre-conversion actions in a principled way in order to predict per-channel attribution. To quantitatively benchmark the proposed MTA model, we employ the real-world Criteo dataset and demonstrate the superior performance of CAMTA with respect to prediction accuracy as compared to several baselines. In addition, we provide results for budget allocation and user-behaviour modeling on the predicted channel attribution.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133179011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00083
D. Maurel, Sylvain Lefebvre, Jérémie Sublime
Nowadays, we can observe a multiplication of multiview data in domains such as marketing, bank administration, survey analysis, or social networks: We are dealing with large data bases that share a fair amount of data representing the same individual with different features depending on the data base. In this context, one can use Machine Learning methods to analyze this fragmented data across several heterogeneous sources (called views). Such analysis is subject to several difficulties: First, not all individual will be present and represented in all data sites and views. And second, this type of cross site analysis raises several ethical questions on privacy issues as no local site should have direct access to data from the other sources. To solve these problems, we present a method called the Cooperative Reconstruction System which aims at reconstructing information missing in some views in a multi-view context using information available in the other views. Furthermore, our method considers privacy issues and therefore achieves said reconstruction without direct data transfer from one view to another.
{"title":"Deep Cooperative Reconstruction with Security Constraints in multi-view environments","authors":"D. Maurel, Sylvain Lefebvre, Jérémie Sublime","doi":"10.1109/ICDMW51313.2020.00083","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00083","url":null,"abstract":"Nowadays, we can observe a multiplication of multiview data in domains such as marketing, bank administration, survey analysis, or social networks: We are dealing with large data bases that share a fair amount of data representing the same individual with different features depending on the data base. In this context, one can use Machine Learning methods to analyze this fragmented data across several heterogeneous sources (called views). Such analysis is subject to several difficulties: First, not all individual will be present and represented in all data sites and views. And second, this type of cross site analysis raises several ethical questions on privacy issues as no local site should have direct access to data from the other sources. To solve these problems, we present a method called the Cooperative Reconstruction System which aims at reconstructing information missing in some views in a multi-view context using information available in the other views. Furthermore, our method considers privacy issues and therefore achieves said reconstruction without direct data transfer from one view to another.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133800420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00055
A. Kovantsev, P. Gladilin
In this study we explore the features of time-series that can be used for evaluation of their predictability. We suggest using features based on Kolmogorov-Sinai entropy, correlation dimension and Hurst exponent to test multivariate predictability. Besides we use two new features such as ‘noise measure’ and ‘random walk detection’. Then we experimentally test the accuracy of multivariate time series forecasting models, including vector autoregressive model (VAR), multivariate singular spectrum analysis (MSSA) model, local approximation (LA) model and recurrent neural network model with long short term memory (LSTM) cells. At last we test different causality methods for choosing additional time series as the predictors and claim that the relevance of taking into account additional predictors highly depends on the characteristics of the target time series and can be estimated using the developed method. The results of the work can be used as theoretical and experimental basis for the development of forecasting applications for the short time series using a combination of corporate and open source data as additional data predictors.
{"title":"Analysis of multivariate time series predictability based on their features","authors":"A. Kovantsev, P. Gladilin","doi":"10.1109/ICDMW51313.2020.00055","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00055","url":null,"abstract":"In this study we explore the features of time-series that can be used for evaluation of their predictability. We suggest using features based on Kolmogorov-Sinai entropy, correlation dimension and Hurst exponent to test multivariate predictability. Besides we use two new features such as ‘noise measure’ and ‘random walk detection’. Then we experimentally test the accuracy of multivariate time series forecasting models, including vector autoregressive model (VAR), multivariate singular spectrum analysis (MSSA) model, local approximation (LA) model and recurrent neural network model with long short term memory (LSTM) cells. At last we test different causality methods for choosing additional time series as the predictors and claim that the relevance of taking into account additional predictors highly depends on the characteristics of the target time series and can be estimated using the developed method. The results of the work can be used as theoretical and experimental basis for the development of forecasting applications for the short time series using a combination of corporate and open source data as additional data predictors.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129423054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00052
D. Lu, D. Ricciuto
Sensitivity analysis in terrestrial ecosystem modeling is important for understanding controlling processes, guiding model development, and targeting new observations to reduce parameter and prediction uncertainty. Complex and computationally expensive terrestrial ecosystem models (TEM) limit the number of ensemble simulations, requiring sophisticated and efficient methods to analyze sensitivities of multiple model responses to different types of parameter uncertainties. In this study, we propose a distance-based global sensitivity analysis (DGSA) method. DGSA first classifies model response samples into a small set of discrete classes and then calculates the distance between parameter frequency distributions in different classes to measure the parameter sensitivity. The principle is that, if the parameter distribution is the same in each class, then the model response is insensitive to the parameter, while a large difference in the distributions indicates the parameter is influential to the response. Built on this idea, DGSA can be applied to analyze sensitivity of a single and a group of responses to different kinds of parameter uncertainties including continuous, discrete and even stochastic. Besides the main-effect sensitivity from a single parameter, DGSA can also quantify the sensitivity from parameter interactions. Additionally, DGSA is computationally efficient which can use a small number of model evaluations to obtain an accurate and statistically significant result. We applied DGSA to two TEMs, one having eight parameters and three kinds of model responses, and the other having 47 parameters and a long-period response. We demonstrated that DGSA can be used for sensitivity problems with multiple responses and high-dimensional parameters efficiently.
{"title":"Efficient Distance-based Global Sensitivity Analysis for Terrestrial Ecosystem Modeling","authors":"D. Lu, D. Ricciuto","doi":"10.1109/ICDMW51313.2020.00052","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00052","url":null,"abstract":"Sensitivity analysis in terrestrial ecosystem modeling is important for understanding controlling processes, guiding model development, and targeting new observations to reduce parameter and prediction uncertainty. Complex and computationally expensive terrestrial ecosystem models (TEM) limit the number of ensemble simulations, requiring sophisticated and efficient methods to analyze sensitivities of multiple model responses to different types of parameter uncertainties. In this study, we propose a distance-based global sensitivity analysis (DGSA) method. DGSA first classifies model response samples into a small set of discrete classes and then calculates the distance between parameter frequency distributions in different classes to measure the parameter sensitivity. The principle is that, if the parameter distribution is the same in each class, then the model response is insensitive to the parameter, while a large difference in the distributions indicates the parameter is influential to the response. Built on this idea, DGSA can be applied to analyze sensitivity of a single and a group of responses to different kinds of parameter uncertainties including continuous, discrete and even stochastic. Besides the main-effect sensitivity from a single parameter, DGSA can also quantify the sensitivity from parameter interactions. Additionally, DGSA is computationally efficient which can use a small number of model evaluations to obtain an accurate and statistically significant result. We applied DGSA to two TEMs, one having eight parameters and three kinds of model responses, and the other having 47 parameters and a long-period response. We demonstrated that DGSA can be used for sensitivity problems with multiple responses and high-dimensional parameters efficiently.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131333939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00120
A. Abraham, L. Dreyfus-Schmidt
Active Learning (AL) is an active domain of research, but is seldom used in the industry despite the pressing needs. This is in part due to a misalignment of objectives, while research strives at getting the best results on selected datasets, the industry wants guarantees that Active Learning will perform consistently and at least better than random labeling. The very one-off nature of Active Learning makes it crucial to understand how strategy selection can be carried out and what drives poor performance (lack of exploration, selection of samples that are too hard to classify, …). To help rebuild trust of industrial practitioners in Active Learning, we present various actionable metrics. Through extensive experiments on reference datasets such as CIFAR100, Fashion-MNIST, and 20Newsgroups, we show that those metrics brings interpretability to AL strategies that can be leveraged by the practitioner.
{"title":"Rebuilding Trust in Active Learning with Actionable Metrics","authors":"A. Abraham, L. Dreyfus-Schmidt","doi":"10.1109/ICDMW51313.2020.00120","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00120","url":null,"abstract":"Active Learning (AL) is an active domain of research, but is seldom used in the industry despite the pressing needs. This is in part due to a misalignment of objectives, while research strives at getting the best results on selected datasets, the industry wants guarantees that Active Learning will perform consistently and at least better than random labeling. The very one-off nature of Active Learning makes it crucial to understand how strategy selection can be carried out and what drives poor performance (lack of exploration, selection of samples that are too hard to classify, …). To help rebuild trust of industrial practitioners in Active Learning, we present various actionable metrics. Through extensive experiments on reference datasets such as CIFAR100, Fashion-MNIST, and 20Newsgroups, we show that those metrics brings interpretability to AL strategies that can be leveraged by the practitioner.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114895659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00050
Daniyal Kazempour, Long Matthias Yan, Peer Kröger, T. Seidl
Having data with a high number of features raises the need to detect clusters which exhibit within subspaces of features a high similarity. These subspaces can be arbitrarily oriented which gave rise to arbitrarily-oriented subspace clustering (AOSC) algorithms. In the diversity of such algorithms some are specialized at detecting clusters which are global, across the entire dataset regardless of any distances, while others are tailored at detecting local clusters. Both of these views (local and global) are obtained separately by each of the algorithms. While from an algebraic point of view, none of both representations can claim to be the true one, it is vital that domain scientists are presented both views, enabling them to inspect and decide which of the representations is closest to the domain specific reality. We propose in this work a framework which is capable to detect locally dense arbitrarily oriented subspace clusters which are embedded within a global one. We also first introduce definitions of locally and globally arbitrarily oriented subspace clusters. Our experiments illustrate that this approach has no significant impact on the cluster quality nor on the runtime performance, and enables scientists to be no longer limited exclusively to either of the local or global views.
{"title":"You see a set of wagons - I see one train: Towards a unified view of local and global arbitrarily oriented subspace clusters","authors":"Daniyal Kazempour, Long Matthias Yan, Peer Kröger, T. Seidl","doi":"10.1109/ICDMW51313.2020.00050","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00050","url":null,"abstract":"Having data with a high number of features raises the need to detect clusters which exhibit within subspaces of features a high similarity. These subspaces can be arbitrarily oriented which gave rise to arbitrarily-oriented subspace clustering (AOSC) algorithms. In the diversity of such algorithms some are specialized at detecting clusters which are global, across the entire dataset regardless of any distances, while others are tailored at detecting local clusters. Both of these views (local and global) are obtained separately by each of the algorithms. While from an algebraic point of view, none of both representations can claim to be the true one, it is vital that domain scientists are presented both views, enabling them to inspect and decide which of the representations is closest to the domain specific reality. We propose in this work a framework which is capable to detect locally dense arbitrarily oriented subspace clusters which are embedded within a global one. We also first introduce definitions of locally and globally arbitrarily oriented subspace clusters. Our experiments illustrate that this approach has no significant impact on the cluster quality nor on the runtime performance, and enables scientists to be no longer limited exclusively to either of the local or global views.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116709168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}