Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00108
Jun Huang, Yu Yan, Xiao Zheng, Xiwen Qu, Xudong Hong
A multi-label learning (MLL) method can simul-taneously process the instances with multiple labels, and many well-known methods have been proposed to solve various MLL-related problems. The existing MLL methods are mainly applied under the assumption of a fixed label set, i.e., the class labels are all observed for the training data. However, in many real-world applications, there may be some unknown labels outside of this set, especially for large-scale and complex datasets. In this paper, a multi-label classification model based on deep learning is proposed to discover the unknown labels for multi-label image classification. It can simultaneously predict known and unknown labels for unseen images. Besides, an attention mechanism is introduced into the model, where the attention maps of unknown labels can be used to observe the corresponding objects of an image and to get the semantic information of these unknown labels.
{"title":"Discovering Unknown Labels for Multi-Label Image Classification","authors":"Jun Huang, Yu Yan, Xiao Zheng, Xiwen Qu, Xudong Hong","doi":"10.1109/ICDMW58026.2022.00108","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00108","url":null,"abstract":"A multi-label learning (MLL) method can simul-taneously process the instances with multiple labels, and many well-known methods have been proposed to solve various MLL-related problems. The existing MLL methods are mainly applied under the assumption of a fixed label set, i.e., the class labels are all observed for the training data. However, in many real-world applications, there may be some unknown labels outside of this set, especially for large-scale and complex datasets. In this paper, a multi-label classification model based on deep learning is proposed to discover the unknown labels for multi-label image classification. It can simultaneously predict known and unknown labels for unseen images. Besides, an attention mechanism is introduced into the model, where the attention maps of unknown labels can be used to observe the corresponding objects of an image and to get the semantic information of these unknown labels.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126573765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00064
A. A. Neloy, M. Turgeon
Deep learning (DL) based natural language processing (NLP) has recently grown as one the fastest research domain and retained remarkable improvement in many applications. Due to the significant amount of data, the adaptation of feature learning and symmetric data efficiency is a critical underlying task in such applications. However, their ability to extract features is limited due to a lack of proper model formation. Moreover, the use of these methods on smaller datasets is unexplored and underdeveloped compared to more popular research areas. This work introduces a two-stage modeling approach to combine classical statistical analysis with NLP problems in a real-world dataset. We effectively layout a combination of the classical statistical model incorporating a stacked ensemble classifier and a DL framework of convolutional neural network (CNN) and Bidirectional Recurrent Neural Networks (Bi-RNN) to structure a more decomposed architecture with lower computational complexity. Additionally, the experimental results illustrating 96.69 % training and 70.56 % testing accuracy and hypothesis testing from our DL models followed by an ablation study empirically demonstrate the validation of our proposed combined modeling technique.
{"title":"Feature Extraction and Prediction of Combined Text and Survey Data using Two-Staged Modeling","authors":"A. A. Neloy, M. Turgeon","doi":"10.1109/ICDMW58026.2022.00064","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00064","url":null,"abstract":"Deep learning (DL) based natural language processing (NLP) has recently grown as one the fastest research domain and retained remarkable improvement in many applications. Due to the significant amount of data, the adaptation of feature learning and symmetric data efficiency is a critical underlying task in such applications. However, their ability to extract features is limited due to a lack of proper model formation. Moreover, the use of these methods on smaller datasets is unexplored and underdeveloped compared to more popular research areas. This work introduces a two-stage modeling approach to combine classical statistical analysis with NLP problems in a real-world dataset. We effectively layout a combination of the classical statistical model incorporating a stacked ensemble classifier and a DL framework of convolutional neural network (CNN) and Bidirectional Recurrent Neural Networks (Bi-RNN) to structure a more decomposed architecture with lower computational complexity. Additionally, the experimental results illustrating 96.69 % training and 70.56 % testing accuracy and hypothesis testing from our DL models followed by an ablation study empirically demonstrate the validation of our proposed combined modeling technique.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116606562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00074
F. Piccialli, F. Giampaolo, Vincenzo Schiano Di Cola, Federico Gatta, Diletta Chiaro, E. Prezioso, Stefano Izzo, S. Cuomo
Thanks to the widespread use of mobile devices, analyses that in the past had to be carried out in specifically designated and equipped laboratories and which required long processing times, may now take place outdoor and in real time. In the marine science, for example, the development of a mobile and compact system for the on-site detection of heavy metals contamination in seawater would be helpful for scientists and society in at least two ways: i) reduction of time and costs associated with these experiments; ii) the implementation of a strategy for outdoor analysis, eventually embeddable in a lab-on-hardware system. This paper falls within the context of machine learning (ML) for utility pattern mining applied on interdisciplinary domains: starting from wellplates images, we provide a novel proof-of-concept (PoC) machine learning-based framework to assist scientists in their daily research on seawater samples, proposing a system which automatically recognise wells in a multiwell firstly and then predicts the degree of fluorescence in each of them, thus showing possible presence of heavy metals.
{"title":"A machine learning-based approach for mercury detection in marine waters","authors":"F. Piccialli, F. Giampaolo, Vincenzo Schiano Di Cola, Federico Gatta, Diletta Chiaro, E. Prezioso, Stefano Izzo, S. Cuomo","doi":"10.1109/ICDMW58026.2022.00074","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00074","url":null,"abstract":"Thanks to the widespread use of mobile devices, analyses that in the past had to be carried out in specifically designated and equipped laboratories and which required long processing times, may now take place outdoor and in real time. In the marine science, for example, the development of a mobile and compact system for the on-site detection of heavy metals contamination in seawater would be helpful for scientists and society in at least two ways: i) reduction of time and costs associated with these experiments; ii) the implementation of a strategy for outdoor analysis, eventually embeddable in a lab-on-hardware system. This paper falls within the context of machine learning (ML) for utility pattern mining applied on interdisciplinary domains: starting from wellplates images, we provide a novel proof-of-concept (PoC) machine learning-based framework to assist scientists in their daily research on seawater samples, proposing a system which automatically recognise wells in a multiwell firstly and then predicts the degree of fluorescence in each of them, thus showing possible presence of heavy metals.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127817917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00096
Udesh Kumarasinghe, Mohamed Nabeel, K. de Zoysa, K. Gunawardana, Charitha Elvitigala
Graph neural networks (GNNs) have achieved re-markable success in many application domains including drug discovery, program analysis, social networks, and cyber security. However, it has been shown that they are not robust against adversarial attacks. In the recent past, many adversarial attacks against homogeneous GNNs and defenses have been proposed. However, most of these attacks and defenses are ineffective on heterogeneous graphs as these algorithms optimize under the assumption that all edge and node types are of the same and further they introduce semantically incorrect edges to perturbed graphs. Here, we first develop, HetePR-BCD, a training time (i.e. poisoning) adversarial attack on heterogeneous graphs that outperforms the start of the art attacks proposed in the literature. Our experimental results on three benchmark heterogeneous graphs show that our attack, with a small perturbation budget of 15 %, degrades the performance up to 32 % (Fl score) compared to existing ones. It is concerning to mention that existing defenses are not robust against our attack. These defenses primarily modify the GNN's neural message passing operators assuming that adversarial attacks tend to connect nodes with dissimilar features, but this assumption does not hold in heterogeneous graphs. We construct HeteroGuard, an effective defense against training time attacks including HetePR-BCD on heterogeneous models. HeteroGuard outperforms the existing defenses by 3–8 % on Fl score depending on the benchmark dataset.
{"title":"HeteroGuard: Defending Heterogeneous Graph Neural Networks against Adversarial Attacks","authors":"Udesh Kumarasinghe, Mohamed Nabeel, K. de Zoysa, K. Gunawardana, Charitha Elvitigala","doi":"10.1109/ICDMW58026.2022.00096","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00096","url":null,"abstract":"Graph neural networks (GNNs) have achieved re-markable success in many application domains including drug discovery, program analysis, social networks, and cyber security. However, it has been shown that they are not robust against adversarial attacks. In the recent past, many adversarial attacks against homogeneous GNNs and defenses have been proposed. However, most of these attacks and defenses are ineffective on heterogeneous graphs as these algorithms optimize under the assumption that all edge and node types are of the same and further they introduce semantically incorrect edges to perturbed graphs. Here, we first develop, HetePR-BCD, a training time (i.e. poisoning) adversarial attack on heterogeneous graphs that outperforms the start of the art attacks proposed in the literature. Our experimental results on three benchmark heterogeneous graphs show that our attack, with a small perturbation budget of 15 %, degrades the performance up to 32 % (Fl score) compared to existing ones. It is concerning to mention that existing defenses are not robust against our attack. These defenses primarily modify the GNN's neural message passing operators assuming that adversarial attacks tend to connect nodes with dissimilar features, but this assumption does not hold in heterogeneous graphs. We construct HeteroGuard, an effective defense against training time attacks including HetePR-BCD on heterogeneous models. HeteroGuard outperforms the existing defenses by 3–8 % on Fl score depending on the benchmark dataset.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133034498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00103
Yu Wang, Tyler Derr
Link prediction is a fundamental problem for network-structured data and has achieved unprecedented success in many real-world applications. Despite the significant progress being made towards improving its performance by characterizing underlined topological patterns or leveraging representation learning, few works have focused on the imbalanced performance among nodes of different degrees. In this paper, we propose a novel problem, degree-related bias and evaluation bias, on link prediction with an emphasis on recommender system applications. We first empirically demonstrate the performance differ-ence among nodes with different degrees and then theoretically prove that Recall is an unbiased evaluation metric compared with Fl, NDCG and Precision. Furthermore, we show that under the unbiased evaluation metric Recall, low-degree nodes tend to have higher performance than high-degree nodes in link prediction.
{"title":"Degree-Related Bias in Link Prediction","authors":"Yu Wang, Tyler Derr","doi":"10.1109/ICDMW58026.2022.00103","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00103","url":null,"abstract":"Link prediction is a fundamental problem for network-structured data and has achieved unprecedented success in many real-world applications. Despite the significant progress being made towards improving its performance by characterizing underlined topological patterns or leveraging representation learning, few works have focused on the imbalanced performance among nodes of different degrees. In this paper, we propose a novel problem, degree-related bias and evaluation bias, on link prediction with an emphasis on recommender system applications. We first empirically demonstrate the performance differ-ence among nodes with different degrees and then theoretically prove that Recall is an unbiased evaluation metric compared with Fl, NDCG and Precision. Furthermore, we show that under the unbiased evaluation metric Recall, low-degree nodes tend to have higher performance than high-degree nodes in link prediction.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133810451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00113
Anne Marthe Sophie Ngo Bibinbe, A. J. Mahamadou, Michael Franklin Mbouopda, E. Nguifo
Anomaly detection in data streams comes with different technical challenges due to the data nature. The main challenges include storage limitations, the speed of data arrival, and concept drifts. In the literature, methods for mining data streams in order to detect anomalies have been proposed. While some methods focus on tackling a specific issue, other methods handle diverse problems but may have high complexity (time and memory). In the present work, we propose DragStream, a novel subsequence anomaly and concept drift detection algorithm for univariate data streams. DragStream extends the subsequence anomaly detection method for time series data Drag to streaming data. Furthermore, the new method is inspired by the well-known Matrix Profile, Drag, and MILOF which are respectively point and subsequence anomaly detection methods for time series and data streams. We conducted intensive experiments and statistical analysis to evaluate the performance of the proposed approach against existing methods. The results show that our method is competitive in performance while being linear in time and memory complexity. Finally, we provide an open-source implementation of the new method.
{"title":"DragStream: An Anomaly And Concept Drift Detector In Univariate Data Streams","authors":"Anne Marthe Sophie Ngo Bibinbe, A. J. Mahamadou, Michael Franklin Mbouopda, E. Nguifo","doi":"10.1109/ICDMW58026.2022.00113","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00113","url":null,"abstract":"Anomaly detection in data streams comes with different technical challenges due to the data nature. The main challenges include storage limitations, the speed of data arrival, and concept drifts. In the literature, methods for mining data streams in order to detect anomalies have been proposed. While some methods focus on tackling a specific issue, other methods handle diverse problems but may have high complexity (time and memory). In the present work, we propose DragStream, a novel subsequence anomaly and concept drift detection algorithm for univariate data streams. DragStream extends the subsequence anomaly detection method for time series data Drag to streaming data. Furthermore, the new method is inspired by the well-known Matrix Profile, Drag, and MILOF which are respectively point and subsequence anomaly detection methods for time series and data streams. We conducted intensive experiments and statistical analysis to evaluate the performance of the proposed approach against existing methods. The results show that our method is competitive in performance while being linear in time and memory complexity. Finally, we provide an open-source implementation of the new method.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123658573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00013
Reda Khoufache, M. Dilmi, Hanene Azzag, Etienne Gofinnet, M. Lebbah
Artificial Intelligence (AI) in supermarkets is moving fast with the recent advances in deep learning. One important project in the retail sector is the development of AI solutions for smart stores, mainly to improve product recognition. In this paper, we present a new framework to address the multi-view image classification using multiple clustering. The proposed framework combines a pre-trained Vision Transformer with a Bayesian Non-Parametric multiple clustering. In this work, we propose an M CM C- based inference approach to learn the column-partition and the row-partitions. This method infers multiple clustering solutions and allows to find automatically the number of clusters. Our method provides interesting results on a multi-view image dataset and emphasizes, on one hand, the power of pre-trained Vision Transformers combined with the multiple clustering algorithm, on the other hand, the usefulness of the Bayesian Non-Parametric modeling, which automatically performs a model selection.
随着深度学习的最新进展,超市中的人工智能(AI)正在迅速发展。零售领域的一个重要项目是为智能商店开发人工智能解决方案,主要是为了提高产品识别。本文提出了一种新的基于多聚类的多视图图像分类框架。该框架将预训练的视觉转换器与贝叶斯非参数多聚类相结合。在这项工作中,我们提出了一种基于M - CM - C的推理方法来学习列分区和行分区。该方法推断出多个聚类解决方案,并允许自动查找聚类的数量。我们的方法在多视图图像数据集上提供了有趣的结果,并且一方面强调了预先训练的视觉变形器与多聚类算法相结合的强大功能,另一方面强调了贝叶斯非参数建模的有用性,该建模可以自动执行模型选择。
{"title":"Emerging properties from Bayesian Non-Parametric for multiple clustering: Application for multi-view image dataset","authors":"Reda Khoufache, M. Dilmi, Hanene Azzag, Etienne Gofinnet, M. Lebbah","doi":"10.1109/ICDMW58026.2022.00013","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00013","url":null,"abstract":"Artificial Intelligence (AI) in supermarkets is moving fast with the recent advances in deep learning. One important project in the retail sector is the development of AI solutions for smart stores, mainly to improve product recognition. In this paper, we present a new framework to address the multi-view image classification using multiple clustering. The proposed framework combines a pre-trained Vision Transformer with a Bayesian Non-Parametric multiple clustering. In this work, we propose an M CM C- based inference approach to learn the column-partition and the row-partitions. This method infers multiple clustering solutions and allows to find automatically the number of clusters. Our method provides interesting results on a multi-view image dataset and emphasizes, on one hand, the power of pre-trained Vision Transformers combined with the multiple clustering algorithm, on the other hand, the usefulness of the Bayesian Non-Parametric modeling, which automatically performs a model selection.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123678262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00075
Yanlin Qi, Fuyin Lai, Guoting Chen, Wensheng Gan
This paper aims to propose an effective algorithm to discover valuable patterns by applying the fuzzy method to the RFM model. RFM analysis is a common method in customer relationship management, through which we can identify valuable customer groups. By combining RFM analysis with frequent pattern mining, valuable RFM - patterns can be found from the RFM-pattern-tree, such as the RFMP-growth algorithm. Aiming to mine patterns that have quantitative relationships among items, we introduce the fuzzy method in the RFM model, and we present a fuzzy - Rfu - tree algorithm in which a new pruning strategy is proposed to prune candidate patterns. Experiments show the effectiveness of the new algorithm. The new algorithm guarantees a high overlap degree with the RFM-patterns gen-erated by RFMP-growth, with more valuable information (with additional fuzzy level) in the mined patterns.
{"title":"Mining Valuable Fuzzy Patterns via the RFM Model","authors":"Yanlin Qi, Fuyin Lai, Guoting Chen, Wensheng Gan","doi":"10.1109/ICDMW58026.2022.00075","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00075","url":null,"abstract":"This paper aims to propose an effective algorithm to discover valuable patterns by applying the fuzzy method to the RFM model. RFM analysis is a common method in customer relationship management, through which we can identify valuable customer groups. By combining RFM analysis with frequent pattern mining, valuable RFM - patterns can be found from the RFM-pattern-tree, such as the RFMP-growth algorithm. Aiming to mine patterns that have quantitative relationships among items, we introduce the fuzzy method in the RFM model, and we present a fuzzy - Rfu - tree algorithm in which a new pruning strategy is proposed to prune candidate patterns. Experiments show the effectiveness of the new algorithm. The new algorithm guarantees a high overlap degree with the RFM-patterns gen-erated by RFMP-growth, with more valuable information (with additional fuzzy level) in the mined patterns.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126541686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00089
Peng Zhou, Yunyun Zhang, Yuan-Ting Yan, Shu Zhao
Feature selection aims to select an optimal minimal feature subset from the original datasets and has become an indispensable preprocessing component before data mining and machine learning, especially in the era of big data. Most feature selection methods implicitly assume that we can know the feature type (categorical, numerical, or mixed) before learning, then design corresponding measurements to calculate the correlation between features. However, in practical applications, features may be generated dynamically and arrive one by one over time, which we call streaming features. Most existing streaming feature selection methods assume that all dynamically generated features are the same type or assume we can know the feature type for each new arriving feature on the fly, but this is unreasonable and unrealistic. Therefore, this paper firstly studies a practical issue of Unknown Type Streaming Feature Selection and proposes a new method to handle it, named UT-SFS. Extensive experimental results indicate the effectiveness of our new method. UT-SFS is nonparametric and does not need to know the feature type before learning, which aligns with practical application needs.
{"title":"Unknown Type Streaming Feature Selection via Maximal Information Coefficient","authors":"Peng Zhou, Yunyun Zhang, Yuan-Ting Yan, Shu Zhao","doi":"10.1109/ICDMW58026.2022.00089","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00089","url":null,"abstract":"Feature selection aims to select an optimal minimal feature subset from the original datasets and has become an indispensable preprocessing component before data mining and machine learning, especially in the era of big data. Most feature selection methods implicitly assume that we can know the feature type (categorical, numerical, or mixed) before learning, then design corresponding measurements to calculate the correlation between features. However, in practical applications, features may be generated dynamically and arrive one by one over time, which we call streaming features. Most existing streaming feature selection methods assume that all dynamically generated features are the same type or assume we can know the feature type for each new arriving feature on the fly, but this is unreasonable and unrealistic. Therefore, this paper firstly studies a practical issue of Unknown Type Streaming Feature Selection and proposes a new method to handle it, named UT-SFS. Extensive experimental results indicate the effectiveness of our new method. UT-SFS is nonparametric and does not need to know the feature type before learning, which aligns with practical application needs.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125894631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00083
Nitin Ramrakhiyani, Sangameshwar Patil, Manideep Jella, Alok Kumar, G. Palshikar
Cyber- physical systems are an important part of many industries such as the chemical process industry, manufac- turing industry, automobiles, and even sophisticated weaponry. Given the economic importance and influence of these systems, they have increasingly faced the cybersecurity attacks. In this paper, we provide a dataset of real-life security incident reports on cyber-physical systems annotated with entities and events that are important for analysing such security incidents. We analyze and identify the limitations of the 'Domain Objects' in Structured Threat Information Expression (STIX) standard as well as recent research literature for the entity type clas- sification schemes in cybersecurity domain. We propose an updated classification scheme for entity types in the cybersecurity domain. The enhanced coverage provided by the entity scheme is important for automated information extraction and natural language understanding of textual reports containing details of the cybersecurity incident reports. We use deep-learning based sequence labelling techniques and cybersecurity domain specific word embed dings to set up a benchmark for entity and event extraction for cyber- physical security incident report analysis. The annotated dataset of real-life industrial security incidents will be made available for research purpose.
{"title":"Extracting Entities and Events from Cyber-Physical Security Incident Reports","authors":"Nitin Ramrakhiyani, Sangameshwar Patil, Manideep Jella, Alok Kumar, G. Palshikar","doi":"10.1109/ICDMW58026.2022.00083","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00083","url":null,"abstract":"Cyber- physical systems are an important part of many industries such as the chemical process industry, manufac- turing industry, automobiles, and even sophisticated weaponry. Given the economic importance and influence of these systems, they have increasingly faced the cybersecurity attacks. In this paper, we provide a dataset of real-life security incident reports on cyber-physical systems annotated with entities and events that are important for analysing such security incidents. We analyze and identify the limitations of the 'Domain Objects' in Structured Threat Information Expression (STIX) standard as well as recent research literature for the entity type clas- sification schemes in cybersecurity domain. We propose an updated classification scheme for entity types in the cybersecurity domain. The enhanced coverage provided by the entity scheme is important for automated information extraction and natural language understanding of textual reports containing details of the cybersecurity incident reports. We use deep-learning based sequence labelling techniques and cybersecurity domain specific word embed dings to set up a benchmark for entity and event extraction for cyber- physical security incident report analysis. The annotated dataset of real-life industrial security incidents will be made available for research purpose.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121620742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}