Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00088
Michael Yeh, Ming Gu
Principal component analysis (PCA) is an important method for dimensionality reduction in data science and machine learning. However, it is expensive for large matrices when only a few components are needed. Existing fast PCA algorithms typically assume the user will supply the number of components needed, but in practice, they may not know this number beforehand. Thus, it is important to have fast PCA algorithms depending on a tolerance. We develop one such algorithm that runs quickly for matrices with rapidly decaying singular values, provide approximation error bounds that are within a constant factor away from optimal, and demonstrate its utility with data from a variety of applications.
{"title":"An Efficient and Reliable Tolerance- Based Algorithm for Principal Component Analysis","authors":"Michael Yeh, Ming Gu","doi":"10.1109/ICDMW58026.2022.00088","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00088","url":null,"abstract":"Principal component analysis (PCA) is an important method for dimensionality reduction in data science and machine learning. However, it is expensive for large matrices when only a few components are needed. Existing fast PCA algorithms typically assume the user will supply the number of components needed, but in practice, they may not know this number beforehand. Thus, it is important to have fast PCA algorithms depending on a tolerance. We develop one such algorithm that runs quickly for matrices with rapidly decaying singular values, provide approximation error bounds that are within a constant factor away from optimal, and demonstrate its utility with data from a variety of applications.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"935 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123062528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00072
Diletta Chiaro, E. Prezioso, Stefano Izzo, F. Giampaolo, S. Cuomo, F. Piccialli
The progress achieved in the field of information and communication technologies, particularly in computer science, and the growing capacity of new types of computational systems (cloud/edge computing) significantly contributed to the cyber-physical systems, networks where cooperating computational entities are intensively linked to the surrounding physical en-vironment and its on-going operations. All that has increased the possibility of undertaking tasks hitherto considered to be an exclusively human concern automatically: hence the gradual yet progressive tendency of many companies to adopt artificial intelligence (AI) and machine learning (ML) technologies to automate human activities. This papers falls within the context of deep learning (DL) for utility pattern mining applied to Industry 4.0. Starting from images supplied by a multinational company operating in the food processing industry, we provide a DL framework for real-time pattern recognition applied in the automation of peach pitters. To this aim, we perform transfer learning (TL) for image segmentation by embedding seven pre-trained encoders into multiple segmentation architectures and evaluate and compare segmentation performance in terms of met-rics and inference speed on our data. Furthermore, we propose an attention mechanism to improve multiscale feature learning in the FPN through attention-guided feature aggregation.
{"title":"Cut the peaches: image segmentation for utility pattern mining in food processing","authors":"Diletta Chiaro, E. Prezioso, Stefano Izzo, F. Giampaolo, S. Cuomo, F. Piccialli","doi":"10.1109/ICDMW58026.2022.00072","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00072","url":null,"abstract":"The progress achieved in the field of information and communication technologies, particularly in computer science, and the growing capacity of new types of computational systems (cloud/edge computing) significantly contributed to the cyber-physical systems, networks where cooperating computational entities are intensively linked to the surrounding physical en-vironment and its on-going operations. All that has increased the possibility of undertaking tasks hitherto considered to be an exclusively human concern automatically: hence the gradual yet progressive tendency of many companies to adopt artificial intelligence (AI) and machine learning (ML) technologies to automate human activities. This papers falls within the context of deep learning (DL) for utility pattern mining applied to Industry 4.0. Starting from images supplied by a multinational company operating in the food processing industry, we provide a DL framework for real-time pattern recognition applied in the automation of peach pitters. To this aim, we perform transfer learning (TL) for image segmentation by embedding seven pre-trained encoders into multiple segmentation architectures and evaluate and compare segmentation performance in terms of met-rics and inference speed on our data. Furthermore, we propose an attention mechanism to improve multiscale feature learning in the FPN through attention-guided feature aggregation.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127588477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00145
Siyan Liu, Dawei Lu, D. Ricciuto, A. Walker
Terrestrial ecosystems play a central role in the global carbon cycle and affect climate change. However, our predictive understanding of these systems is still limited due to their complexity and uncertainty about how key drivers and their legacy effects influence carbon fluxes. Here, we propose an interpretable Long Short-Term Memory (iLSTM) network for predicting net ecosystem CO2 exchange (NEE) and interpreting the influence on the NEE prediction from environmental drivers and their memory effects. We consider five drivers and apply the method to three forest sites in the United States. Besides performing the prediction in each site, we also conduct transfer learning by using the iLSTM model trained in one site to predict at other sites. Results show that the iLSTM model produces good NEE predictions for all three sites and, more importantly, it provides reasonable interpretations on the input driver's importance as well as their temporal importance on the NEE prediction. Additionally, the iLSTM model demonstrates good across-site transferability in terms of both prediction accuracy and interpretability. The transferability can improve the NEE prediction in unobserved forest sites, and the interpretability advances our predictive understanding and guides process-based model development.
{"title":"Improving net ecosystem CO2 flux prediction using memory-based interpretable machine learning","authors":"Siyan Liu, Dawei Lu, D. Ricciuto, A. Walker","doi":"10.1109/ICDMW58026.2022.00145","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00145","url":null,"abstract":"Terrestrial ecosystems play a central role in the global carbon cycle and affect climate change. However, our predictive understanding of these systems is still limited due to their complexity and uncertainty about how key drivers and their legacy effects influence carbon fluxes. Here, we propose an interpretable Long Short-Term Memory (iLSTM) network for predicting net ecosystem CO2 exchange (NEE) and interpreting the influence on the NEE prediction from environmental drivers and their memory effects. We consider five drivers and apply the method to three forest sites in the United States. Besides performing the prediction in each site, we also conduct transfer learning by using the iLSTM model trained in one site to predict at other sites. Results show that the iLSTM model produces good NEE predictions for all three sites and, more importantly, it provides reasonable interpretations on the input driver's importance as well as their temporal importance on the NEE prediction. Additionally, the iLSTM model demonstrates good across-site transferability in terms of both prediction accuracy and interpretability. The transferability can improve the NEE prediction in unobserved forest sites, and the interpretability advances our predictive understanding and guides process-based model development.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"R-30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126631298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00120
Yaman Kumar Singla, Jui Shah, Changyou Chen, R. Shah
Transformer models across multiple domains such as natural language processing and speech form an unavoidable part of the tech stack of practitioners and researchers alike. Au-dio transformers that exploit representational learning to train on unlabeled speech have recently been used for tasks from speaker verification to discourse-coherence with much success. However, little is known about what these models learn and represent in the high-dimensional latent space. In this paper, we interpret two such recent state-of-the-art models, wav2vec2.0 and Mockingjay, on linguistic and acoustic features. We probe each of their layers to understand what it is learning and at the same time, we draw a distinction between the two models. By comparing their performance across a wide variety of settings including native, non-native, read and spontaneous speeches, we also show how much these models are able to learn transferable features. Our results show that the models are capable of significantly capturing a wide range of characteristics such as audio, fluency, supraseg-mental pronunciation, and even syntactic and semantic text-based characteristics. For each category of characteristics, we identify a learning pattern for each framework and conclude which model and which layer of that model is better for a specific category of feature to choose for feature extraction for downstream tasks.
{"title":"What Do Audio Transformers Hear? Probing Their Representations For Language Delivery & Structure","authors":"Yaman Kumar Singla, Jui Shah, Changyou Chen, R. Shah","doi":"10.1109/ICDMW58026.2022.00120","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00120","url":null,"abstract":"Transformer models across multiple domains such as natural language processing and speech form an unavoidable part of the tech stack of practitioners and researchers alike. Au-dio transformers that exploit representational learning to train on unlabeled speech have recently been used for tasks from speaker verification to discourse-coherence with much success. However, little is known about what these models learn and represent in the high-dimensional latent space. In this paper, we interpret two such recent state-of-the-art models, wav2vec2.0 and Mockingjay, on linguistic and acoustic features. We probe each of their layers to understand what it is learning and at the same time, we draw a distinction between the two models. By comparing their performance across a wide variety of settings including native, non-native, read and spontaneous speeches, we also show how much these models are able to learn transferable features. Our results show that the models are capable of significantly capturing a wide range of characteristics such as audio, fluency, supraseg-mental pronunciation, and even syntactic and semantic text-based characteristics. For each category of characteristics, we identify a learning pattern for each framework and conclude which model and which layer of that model is better for a specific category of feature to choose for feature extraction for downstream tasks.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127568500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00071
Naji Alhusaini, Jing Li, Philippe Fournier-Viger, Ammar Hawbani, Guilin Chen
High Utility Itemset Mining (HUIM) is the task of extracting actionable patterns considering the utility of items such as profits and quantities. An important issue with traditional HUIM methods is that they evaluate all items using a single threshold, which is inconsistent with reality due to differences in the nature and importance of items. Recently, algorithms were proposed to address this problem by assigning a minimum item utility threshold to each item. However, since the minimum item utility (MIU) is expressed as a percentage of the external utility, these methods still face two problems, called “itemset missing” and “itemset explosion”. To solve these problems, this paper introduces a novel notion of Utility Deviation (UD), which is calculated based on the standard deviation. The U D and actual utility are jointly used to calculate the MIU of items. By doing so, the problems of “itemset missing” and “itemset explosion” are alleviated. To implement and evaluate the U D notion, a novel algorithm is proposed, called HUI-MMU-UD. Experimental results demonstrate the effectiveness of the proposed notion for solving the problems of “itemset missing” and “itemset explosion”. Results also show that the proposed algorithm outperforms the previous HUI-MMU algorithm in many cases, in terms of runtime and memory usage.
{"title":"Mining High Utility Itemset with Multiple Minimum Utility Thresholds Based on Utility Deviation","authors":"Naji Alhusaini, Jing Li, Philippe Fournier-Viger, Ammar Hawbani, Guilin Chen","doi":"10.1109/ICDMW58026.2022.00071","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00071","url":null,"abstract":"High Utility Itemset Mining (HUIM) is the task of extracting actionable patterns considering the utility of items such as profits and quantities. An important issue with traditional HUIM methods is that they evaluate all items using a single threshold, which is inconsistent with reality due to differences in the nature and importance of items. Recently, algorithms were proposed to address this problem by assigning a minimum item utility threshold to each item. However, since the minimum item utility (MIU) is expressed as a percentage of the external utility, these methods still face two problems, called “itemset missing” and “itemset explosion”. To solve these problems, this paper introduces a novel notion of Utility Deviation (UD), which is calculated based on the standard deviation. The U D and actual utility are jointly used to calculate the MIU of items. By doing so, the problems of “itemset missing” and “itemset explosion” are alleviated. To implement and evaluate the U D notion, a novel algorithm is proposed, called HUI-MMU-UD. Experimental results demonstrate the effectiveness of the proposed notion for solving the problems of “itemset missing” and “itemset explosion”. Results also show that the proposed algorithm outperforms the previous HUI-MMU algorithm in many cases, in terms of runtime and memory usage.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"72 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126926019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00119
Yihe Wang, Mohammad Mahdi Khalili, X. Zhang
With graph-structured tremendous information, Knowledge Graphs (KG) aroused increasing interest in aca-demic research and industrial applications. Recent studies have shown demographic bias, in terms of sensitive attributes (e.g., gender and race), exist in the learned representations of KG entities. Such bias negatively affects specific popu-lations, especially minorities and underrepresented groups, and exacerbates machine learning-based human inequality. Adversariallearning is regarded as an effective way to alleviate bias in the representation learning model by simultaneously training a task-specific predictor and a sensitive attribute-specific discriminator. However, due to the unique challenge caused by topological structure and the comprehensive re-lationship between knowledge entities, adversarial learning-based debiasing is rarely studied in representation learning in knowledge graphs. In this paper, we propose a framework to learn unbiased representations for nodes and edges in knowledge graph mining. Specifically, we integrate a simple-but-effective normalization technique with Graph Neural Networks (GNNs) to constrain the weights updating process. Moreover, as a work-in-progress paper, we also find that the introduced weights normalization technique can mitigate the pitfalls of instability in adversarial debasing towards fair-and-stable machine learning. We evaluate the proposed framework on a benchmarking graph with multiple edge types and node types. The experimental results show that our model achieves comparable or better gender fairness over three competitive baselines on Equality of Odds. Importantly, our superiority in the fair model does not scarify the performance in the knowledge graph task (i.e., multi-class edge classification).
{"title":"Towards Fair Representation Learning in Knowledge Graph with Stable Adversarial Debiasing","authors":"Yihe Wang, Mohammad Mahdi Khalili, X. Zhang","doi":"10.1109/ICDMW58026.2022.00119","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00119","url":null,"abstract":"With graph-structured tremendous information, Knowledge Graphs (KG) aroused increasing interest in aca-demic research and industrial applications. Recent studies have shown demographic bias, in terms of sensitive attributes (e.g., gender and race), exist in the learned representations of KG entities. Such bias negatively affects specific popu-lations, especially minorities and underrepresented groups, and exacerbates machine learning-based human inequality. Adversariallearning is regarded as an effective way to alleviate bias in the representation learning model by simultaneously training a task-specific predictor and a sensitive attribute-specific discriminator. However, due to the unique challenge caused by topological structure and the comprehensive re-lationship between knowledge entities, adversarial learning-based debiasing is rarely studied in representation learning in knowledge graphs. In this paper, we propose a framework to learn unbiased representations for nodes and edges in knowledge graph mining. Specifically, we integrate a simple-but-effective normalization technique with Graph Neural Networks (GNNs) to constrain the weights updating process. Moreover, as a work-in-progress paper, we also find that the introduced weights normalization technique can mitigate the pitfalls of instability in adversarial debasing towards fair-and-stable machine learning. We evaluate the proposed framework on a benchmarking graph with multiple edge types and node types. The experimental results show that our model achieves comparable or better gender fairness over three competitive baselines on Equality of Odds. Importantly, our superiority in the fair model does not scarify the performance in the knowledge graph task (i.e., multi-class edge classification).","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126277893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Knowledge base completion (KBC) aims to predict the missing links in knowledge graphs. Previous KBC tasks and approaches mainly focus on the setting where all test entities and relations have appeared in the training set. However, there has been limited research on the zero-shot KBC settings, where we need to deal with unseen entities and relations that emerge in a constantly growing knowledge base. In this work, we systematically examine different possible scenarios of zero-shot KBC and develop a comprehensive benchmark, ZeroKBC, that covers these scenarios with diverse types of knowledge sources. Our systematic analysis reveals several missing yet important zero-shot KBC settings. Experimental results show that canonical and state-of-the-art KBC systems cannot achieve satisfactory performance on this challenging benchmark. By analyzing the strength and weaknesses of these systems on solving ZeroKBC, we further present several important observations and promising future directions.11Work was done during the internship at Tencent AI lab. The data and code are available at: https://github.com/brickee/ZeroKBC
{"title":"ZeroKBC: A Comprehensive Benchmark for Zero-Shot Knowledge Base Completion","authors":"Pei Chen, Wenlin Yao, Hongming Zhang, Xiaoman Pan, Dian Yu, Dong Yu, Jianshu Chen","doi":"10.1109/ICDMW58026.2022.00117","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00117","url":null,"abstract":"Knowledge base completion (KBC) aims to predict the missing links in knowledge graphs. Previous KBC tasks and approaches mainly focus on the setting where all test entities and relations have appeared in the training set. However, there has been limited research on the zero-shot KBC settings, where we need to deal with unseen entities and relations that emerge in a constantly growing knowledge base. In this work, we systematically examine different possible scenarios of zero-shot KBC and develop a comprehensive benchmark, ZeroKBC, that covers these scenarios with diverse types of knowledge sources. Our systematic analysis reveals several missing yet important zero-shot KBC settings. Experimental results show that canonical and state-of-the-art KBC systems cannot achieve satisfactory performance on this challenging benchmark. By analyzing the strength and weaknesses of these systems on solving ZeroKBC, we further present several important observations and promising future directions.11Work was done during the internship at Tencent AI lab. The data and code are available at: https://github.com/brickee/ZeroKBC","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125352011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00139
Timon Sachweh, Daniel Boiar, T. Liebig
Data privacy and decentralised data collection has become more and more popular in recent years. In order to solve issues with privacy, communication bandwidth and learning from spatio-temporal data, we will propose two efficient models which use Differential Privacy and decentralized LSTM-Learning: One, in which a Long Short Term Memory (LSTM) model is learned for extracting local temporal node constraints and feeding them into a Dense-Layer (LabeIProportionToLocal). The other approach extends the first one by fetching histogram data from the neighbors and joining the information with the LSTM output (LabeIProportionToDense). For evaluation two popular datasets are used: Pems-Bay and METR-LA. Additionally, we provide an own dataset, which is based on LuST. The evaluation will show the tradeoff between performance and data privacy.
{"title":"Distributed LSTM-Learning from Differentially Private Label Proportions","authors":"Timon Sachweh, Daniel Boiar, T. Liebig","doi":"10.1109/ICDMW58026.2022.00139","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00139","url":null,"abstract":"Data privacy and decentralised data collection has become more and more popular in recent years. In order to solve issues with privacy, communication bandwidth and learning from spatio-temporal data, we will propose two efficient models which use Differential Privacy and decentralized LSTM-Learning: One, in which a Long Short Term Memory (LSTM) model is learned for extracting local temporal node constraints and feeding them into a Dense-Layer (LabeIProportionToLocal). The other approach extends the first one by fetching histogram data from the neighbors and joining the information with the LSTM output (LabeIProportionToDense). For evaluation two popular datasets are used: Pems-Bay and METR-LA. Additionally, we provide an own dataset, which is based on LuST. The evaluation will show the tradeoff between performance and data privacy.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133036493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00077
J. Wu, Shuo Liu, Jerry Chun‐wei Lin
High utility sequential pattern mining (HUSPM) considers timestamp, internal quantization, and external utility factors to mine high utility sequential patterns (HUSP), which has taken an essential place in data mining. The data collection may be uncertain in real life due to environmental factors, equipment limitations, privacy issues, etc. With the rapid increase of uncertain data volume, the efficiency of traditional mining algorithms decreases seriously. When the data volume is large, the conventional stand-alone algorithm will generate more candidate sequences, occupy a lot of memory, and significantly affect the execution speed. This paper designs a high utility probability sequence pattern mining algorithm based on MapReduce. The algorithm utilizes the MapReduce framework to solve the bottleneck of single-computer operation when the data volume is too large. The algorithm adopts an effective pruning strategy, which can effectively handle and reduce the number of candidate itemsets generated, thus the performance of the designed model can be greatly improved. The performance of the proposed algorithm is verified experimentally, and the correctness and completeness of the proposed algorithm are demonstrated and discussed to show the great achievement of the designed model.
{"title":"Large-Scale Sequential Utility Pattern Mining in Uncertain Environments","authors":"J. Wu, Shuo Liu, Jerry Chun‐wei Lin","doi":"10.1109/ICDMW58026.2022.00077","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00077","url":null,"abstract":"High utility sequential pattern mining (HUSPM) considers timestamp, internal quantization, and external utility factors to mine high utility sequential patterns (HUSP), which has taken an essential place in data mining. The data collection may be uncertain in real life due to environmental factors, equipment limitations, privacy issues, etc. With the rapid increase of uncertain data volume, the efficiency of traditional mining algorithms decreases seriously. When the data volume is large, the conventional stand-alone algorithm will generate more candidate sequences, occupy a lot of memory, and significantly affect the execution speed. This paper designs a high utility probability sequence pattern mining algorithm based on MapReduce. The algorithm utilizes the MapReduce framework to solve the bottleneck of single-computer operation when the data volume is too large. The algorithm adopts an effective pruning strategy, which can effectively handle and reduce the number of candidate itemsets generated, thus the performance of the designed model can be greatly improved. The performance of the proposed algorithm is verified experimentally, and the correctness and completeness of the proposed algorithm are demonstrated and discussed to show the great achievement of the designed model.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132924428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00043
Elizabeth D. Hathaway, R. Hathaway
The iVAT (improved Visual Assessment of cluster Tendency) image is a useful tool for assessing possible cluster structure in an unlabeled, numerical data set. If labeled data are available then it is sometimes helpful to determine how closely the (unlabeled) data clusters agree with the data partitioning based on the labels. In this note the DCiVAT (Diagonally Colorized iVAT) image is introduced for the case of labeled data. It incorporates all available data and label information into a single colorized iVAT image so that it is possible to visually assess the degree to which data clusters are aligned with label categories. The new approach is illustrated with several examples.
{"title":"Diagonally Colorized iVAT Images for Labeled Data","authors":"Elizabeth D. Hathaway, R. Hathaway","doi":"10.1109/ICDMW58026.2022.00043","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00043","url":null,"abstract":"The iVAT (improved Visual Assessment of cluster Tendency) image is a useful tool for assessing possible cluster structure in an unlabeled, numerical data set. If labeled data are available then it is sometimes helpful to determine how closely the (unlabeled) data clusters agree with the data partitioning based on the labels. In this note the DCiVAT (Diagonally Colorized iVAT) image is introduced for the case of labeled data. It incorporates all available data and label information into a single colorized iVAT image so that it is possible to visually assess the degree to which data clusters are aligned with label categories. The new approach is illustrated with several examples.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133342849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}