首页 > 最新文献

2022 IEEE International Conference on Data Mining Workshops (ICDMW)最新文献

英文 中文
An application of Customer Embedding for Clustering 客户嵌入在聚类中的应用
Pub Date : 2022-11-01 DOI: 10.1109/ICDMW58026.2022.00019
Ahmet Tugrul Bayrak
Effective and powerful strategic planning in a competitive business environment brings businesses to the fore. It is important for the growth of the business to move the customer to the center by acting more intelligently in the planning of marketing and sales activities. In order to find customer behavior patterns, the use of clustering models from machine learning algorithms can yield effective results. In this study, traditional customer clustering methods are enriched by using customer representations as features. To be able to achieve that, a natural language processing method, word embedding, is applied to customers. By using the powerful mechanism of word embedding methods, a customer space is created where the customers are represented based on the products they have bought. It is observed that appending customer embeddings for customer clustering have a positive effect and the results seem promising for further studies.
在竞争激烈的商业环境中,有效而有力的战略规划使企业脱颖而出。通过在市场营销和销售活动的计划中采取更明智的行动,将客户转移到中心位置,这对业务的增长非常重要。为了发现客户行为模式,使用机器学习算法中的聚类模型可以产生有效的结果。在本研究中,利用客户表征作为特征,丰富了传统的客户聚类方法。为了实现这一目标,一种自然语言处理方法——词嵌入——被应用于客户。通过使用强大的词嵌入方法机制,创建了一个客户空间,其中客户是根据他们购买的产品来表示的。观察到附加顾客嵌入对顾客聚类有积极的影响,结果表明有进一步研究的前景。
{"title":"An application of Customer Embedding for Clustering","authors":"Ahmet Tugrul Bayrak","doi":"10.1109/ICDMW58026.2022.00019","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00019","url":null,"abstract":"Effective and powerful strategic planning in a competitive business environment brings businesses to the fore. It is important for the growth of the business to move the customer to the center by acting more intelligently in the planning of marketing and sales activities. In order to find customer behavior patterns, the use of clustering models from machine learning algorithms can yield effective results. In this study, traditional customer clustering methods are enriched by using customer representations as features. To be able to achieve that, a natural language processing method, word embedding, is applied to customers. By using the powerful mechanism of word embedding methods, a customer space is created where the customers are represented based on the products they have bought. It is observed that appending customer embeddings for customer clustering have a positive effect and the results seem promising for further studies.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124931831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
What Do Audio Transformers Hear? Probing Their Representations For Language Delivery & Structure 音频变压器听到什么?语言表达与结构表征探析
Pub Date : 2022-11-01 DOI: 10.1109/ICDMW58026.2022.00120
Yaman Kumar Singla, Jui Shah, Changyou Chen, R. Shah
Transformer models across multiple domains such as natural language processing and speech form an unavoidable part of the tech stack of practitioners and researchers alike. Au-dio transformers that exploit representational learning to train on unlabeled speech have recently been used for tasks from speaker verification to discourse-coherence with much success. However, little is known about what these models learn and represent in the high-dimensional latent space. In this paper, we interpret two such recent state-of-the-art models, wav2vec2.0 and Mockingjay, on linguistic and acoustic features. We probe each of their layers to understand what it is learning and at the same time, we draw a distinction between the two models. By comparing their performance across a wide variety of settings including native, non-native, read and spontaneous speeches, we also show how much these models are able to learn transferable features. Our results show that the models are capable of significantly capturing a wide range of characteristics such as audio, fluency, supraseg-mental pronunciation, and even syntactic and semantic text-based characteristics. For each category of characteristics, we identify a learning pattern for each framework and conclude which model and which layer of that model is better for a specific category of feature to choose for feature extraction for downstream tasks.
跨多个领域(如自然语言处理和语音)的转换模型是从业者和研究人员不可避免的技术堆栈的一部分。利用表征学习对未标记语音进行训练的音频转换器最近被用于从说话人验证到话语连贯的任务,并取得了很大的成功。然而,对于这些模型在高维潜在空间中学习和表示什么,人们知之甚少。在本文中,我们解释了两个最新的最先进的模型,wav2vec2.0和Mockingjay,关于语言和声学特征。我们探测它们的每一层,以了解它在学习什么,同时,我们在两个模型之间画出区别。通过比较它们在各种环境下的表现,包括母语、非母语、阅读和自发演讲,我们也展示了这些模型能够学习到多少可转移的特征。我们的研究结果表明,这些模型能够显著地捕获广泛的特征,如音频、流利性、超心理发音,甚至是基于文本的句法和语义特征。对于每个类别的特征,我们为每个框架确定一个学习模式,并得出结论,哪个模型和该模型的哪一层更适合用于下游任务的特定类别的特征提取。
{"title":"What Do Audio Transformers Hear? Probing Their Representations For Language Delivery & Structure","authors":"Yaman Kumar Singla, Jui Shah, Changyou Chen, R. Shah","doi":"10.1109/ICDMW58026.2022.00120","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00120","url":null,"abstract":"Transformer models across multiple domains such as natural language processing and speech form an unavoidable part of the tech stack of practitioners and researchers alike. Au-dio transformers that exploit representational learning to train on unlabeled speech have recently been used for tasks from speaker verification to discourse-coherence with much success. However, little is known about what these models learn and represent in the high-dimensional latent space. In this paper, we interpret two such recent state-of-the-art models, wav2vec2.0 and Mockingjay, on linguistic and acoustic features. We probe each of their layers to understand what it is learning and at the same time, we draw a distinction between the two models. By comparing their performance across a wide variety of settings including native, non-native, read and spontaneous speeches, we also show how much these models are able to learn transferable features. Our results show that the models are capable of significantly capturing a wide range of characteristics such as audio, fluency, supraseg-mental pronunciation, and even syntactic and semantic text-based characteristics. For each category of characteristics, we identify a learning pattern for each framework and conclude which model and which layer of that model is better for a specific category of feature to choose for feature extraction for downstream tasks.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127568500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Towards Fair Representation Learning in Knowledge Graph with Stable Adversarial Debiasing 基于稳定对抗去偏的知识图公平表示学习研究
Pub Date : 2022-11-01 DOI: 10.1109/ICDMW58026.2022.00119
Yihe Wang, Mohammad Mahdi Khalili, X. Zhang
With graph-structured tremendous information, Knowledge Graphs (KG) aroused increasing interest in aca-demic research and industrial applications. Recent studies have shown demographic bias, in terms of sensitive attributes (e.g., gender and race), exist in the learned representations of KG entities. Such bias negatively affects specific popu-lations, especially minorities and underrepresented groups, and exacerbates machine learning-based human inequality. Adversariallearning is regarded as an effective way to alleviate bias in the representation learning model by simultaneously training a task-specific predictor and a sensitive attribute-specific discriminator. However, due to the unique challenge caused by topological structure and the comprehensive re-lationship between knowledge entities, adversarial learning-based debiasing is rarely studied in representation learning in knowledge graphs. In this paper, we propose a framework to learn unbiased representations for nodes and edges in knowledge graph mining. Specifically, we integrate a simple-but-effective normalization technique with Graph Neural Networks (GNNs) to constrain the weights updating process. Moreover, as a work-in-progress paper, we also find that the introduced weights normalization technique can mitigate the pitfalls of instability in adversarial debasing towards fair-and-stable machine learning. We evaluate the proposed framework on a benchmarking graph with multiple edge types and node types. The experimental results show that our model achieves comparable or better gender fairness over three competitive baselines on Equality of Odds. Importantly, our superiority in the fair model does not scarify the performance in the knowledge graph task (i.e., multi-class edge classification).
知识图谱(Knowledge Graphs, KG)以其庞大的信息结构引起了越来越多的学术研究和工业应用的兴趣。最近的研究表明,在KG实体的学习表征中存在敏感属性(如性别和种族)方面的人口统计学偏见。这种偏见对特定人群产生了负面影响,尤其是少数民族和代表性不足的群体,并加剧了基于机器学习的人类不平等。通过同时训练特定任务的预测器和特定属性的敏感判别器,对抗学习被认为是缓解表征学习模型偏差的有效方法。然而,由于拓扑结构带来的独特挑战和知识实体之间的综合关系,在知识图表示学习中基于对抗性学习的去偏研究很少。在本文中,我们提出了一个框架来学习知识图挖掘中节点和边的无偏表示。具体来说,我们将一种简单而有效的归一化技术与图神经网络(gnn)相结合,以约束权重更新过程。此外,作为一篇正在进行的论文,我们还发现引入的权重归一化技术可以减轻对抗性贬低中不稳定的陷阱,从而实现公平和稳定的机器学习。我们在具有多个边类型和节点类型的基准图上评估了所提出的框架。实验结果表明,我们的模型在赔率平等的三个竞争基线上达到了相当或更好的性别公平。重要的是,我们在公平模型上的优势并没有牺牲知识图任务(即多类边缘分类)的性能。
{"title":"Towards Fair Representation Learning in Knowledge Graph with Stable Adversarial Debiasing","authors":"Yihe Wang, Mohammad Mahdi Khalili, X. Zhang","doi":"10.1109/ICDMW58026.2022.00119","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00119","url":null,"abstract":"With graph-structured tremendous information, Knowledge Graphs (KG) aroused increasing interest in aca-demic research and industrial applications. Recent studies have shown demographic bias, in terms of sensitive attributes (e.g., gender and race), exist in the learned representations of KG entities. Such bias negatively affects specific popu-lations, especially minorities and underrepresented groups, and exacerbates machine learning-based human inequality. Adversariallearning is regarded as an effective way to alleviate bias in the representation learning model by simultaneously training a task-specific predictor and a sensitive attribute-specific discriminator. However, due to the unique challenge caused by topological structure and the comprehensive re-lationship between knowledge entities, adversarial learning-based debiasing is rarely studied in representation learning in knowledge graphs. In this paper, we propose a framework to learn unbiased representations for nodes and edges in knowledge graph mining. Specifically, we integrate a simple-but-effective normalization technique with Graph Neural Networks (GNNs) to constrain the weights updating process. Moreover, as a work-in-progress paper, we also find that the introduced weights normalization technique can mitigate the pitfalls of instability in adversarial debasing towards fair-and-stable machine learning. We evaluate the proposed framework on a benchmarking graph with multiple edge types and node types. The experimental results show that our model achieves comparable or better gender fairness over three competitive baselines on Equality of Odds. Importantly, our superiority in the fair model does not scarify the performance in the knowledge graph task (i.e., multi-class edge classification).","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126277893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Cut the peaches: image segmentation for utility pattern mining in food processing 切桃子:食品加工中实用模式挖掘的图像分割
Pub Date : 2022-11-01 DOI: 10.1109/ICDMW58026.2022.00072
Diletta Chiaro, E. Prezioso, Stefano Izzo, F. Giampaolo, S. Cuomo, F. Piccialli
The progress achieved in the field of information and communication technologies, particularly in computer science, and the growing capacity of new types of computational systems (cloud/edge computing) significantly contributed to the cyber-physical systems, networks where cooperating computational entities are intensively linked to the surrounding physical en-vironment and its on-going operations. All that has increased the possibility of undertaking tasks hitherto considered to be an exclusively human concern automatically: hence the gradual yet progressive tendency of many companies to adopt artificial intelligence (AI) and machine learning (ML) technologies to automate human activities. This papers falls within the context of deep learning (DL) for utility pattern mining applied to Industry 4.0. Starting from images supplied by a multinational company operating in the food processing industry, we provide a DL framework for real-time pattern recognition applied in the automation of peach pitters. To this aim, we perform transfer learning (TL) for image segmentation by embedding seven pre-trained encoders into multiple segmentation architectures and evaluate and compare segmentation performance in terms of met-rics and inference speed on our data. Furthermore, we propose an attention mechanism to improve multiscale feature learning in the FPN through attention-guided feature aggregation.
信息和通信技术领域取得的进展,特别是在计算机科学领域取得的进展,以及新型计算系统(云/边缘计算)不断增长的能力,极大地促进了网络物理系统的发展,在网络物理系统中,协作计算实体与周围物理环境及其持续运行紧密相连。所有这些都增加了自动完成迄今为止被认为是人类独有的任务的可能性:因此,许多公司逐渐采用人工智能(AI)和机器学习(ML)技术来实现人类活动自动化的趋势。本文属于深度学习(DL)的实用模式挖掘应用于工业4.0的背景下。从一家从事食品加工行业的跨国公司提供的图像开始,我们提供了一个用于实时模式识别的深度学习框架,该框架应用于桃子打罐自动化。为此,我们通过将七个预训练的编码器嵌入到多个分割架构中来执行图像分割的迁移学习(TL),并根据数据的度量和推理速度评估和比较分割性能。此外,我们提出了一种注意机制,通过注意引导的特征聚合来改善FPN中的多尺度特征学习。
{"title":"Cut the peaches: image segmentation for utility pattern mining in food processing","authors":"Diletta Chiaro, E. Prezioso, Stefano Izzo, F. Giampaolo, S. Cuomo, F. Piccialli","doi":"10.1109/ICDMW58026.2022.00072","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00072","url":null,"abstract":"The progress achieved in the field of information and communication technologies, particularly in computer science, and the growing capacity of new types of computational systems (cloud/edge computing) significantly contributed to the cyber-physical systems, networks where cooperating computational entities are intensively linked to the surrounding physical en-vironment and its on-going operations. All that has increased the possibility of undertaking tasks hitherto considered to be an exclusively human concern automatically: hence the gradual yet progressive tendency of many companies to adopt artificial intelligence (AI) and machine learning (ML) technologies to automate human activities. This papers falls within the context of deep learning (DL) for utility pattern mining applied to Industry 4.0. Starting from images supplied by a multinational company operating in the food processing industry, we provide a DL framework for real-time pattern recognition applied in the automation of peach pitters. To this aim, we perform transfer learning (TL) for image segmentation by embedding seven pre-trained encoders into multiple segmentation architectures and evaluate and compare segmentation performance in terms of met-rics and inference speed on our data. Furthermore, we propose an attention mechanism to improve multiscale feature learning in the FPN through attention-guided feature aggregation.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127588477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ZeroKBC: A Comprehensive Benchmark for Zero-Shot Knowledge Base Completion ZeroKBC:零射击知识库完成的综合基准
Pub Date : 2022-11-01 DOI: 10.1109/ICDMW58026.2022.00117
Pei Chen, Wenlin Yao, Hongming Zhang, Xiaoman Pan, Dian Yu, Dong Yu, Jianshu Chen
Knowledge base completion (KBC) aims to predict the missing links in knowledge graphs. Previous KBC tasks and approaches mainly focus on the setting where all test entities and relations have appeared in the training set. However, there has been limited research on the zero-shot KBC settings, where we need to deal with unseen entities and relations that emerge in a constantly growing knowledge base. In this work, we systematically examine different possible scenarios of zero-shot KBC and develop a comprehensive benchmark, ZeroKBC, that covers these scenarios with diverse types of knowledge sources. Our systematic analysis reveals several missing yet important zero-shot KBC settings. Experimental results show that canonical and state-of-the-art KBC systems cannot achieve satisfactory performance on this challenging benchmark. By analyzing the strength and weaknesses of these systems on solving ZeroKBC, we further present several important observations and promising future directions.11Work was done during the internship at Tencent AI lab. The data and code are available at: https://github.com/brickee/ZeroKBC
知识库补全(KBC)的目的是预测知识图中缺失的环节。以前的KBC任务和方法主要集中在所有测试实体和关系都出现在训练集中的设置上。然而,关于零射击KBC设置的研究有限,我们需要处理在不断增长的知识库中出现的看不见的实体和关系。在这项工作中,我们系统地研究了零射击KBC的不同可能场景,并开发了一个综合基准ZeroKBC,该基准涵盖了具有不同类型知识来源的这些场景。我们的系统分析揭示了几个缺失但重要的零射击KBC设置。实验结果表明,规范的和最先进的KBC系统不能在这个具有挑战性的基准上取得令人满意的性能。通过分析这些系统在解决ZeroKBC问题上的优缺点,我们进一步提出了一些重要的观察结果和有希望的未来方向。11工作是在腾讯AI实验室实习期间完成的。数据和代码可从https://github.com/brickee/ZeroKBC获得
{"title":"ZeroKBC: A Comprehensive Benchmark for Zero-Shot Knowledge Base Completion","authors":"Pei Chen, Wenlin Yao, Hongming Zhang, Xiaoman Pan, Dian Yu, Dong Yu, Jianshu Chen","doi":"10.1109/ICDMW58026.2022.00117","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00117","url":null,"abstract":"Knowledge base completion (KBC) aims to predict the missing links in knowledge graphs. Previous KBC tasks and approaches mainly focus on the setting where all test entities and relations have appeared in the training set. However, there has been limited research on the zero-shot KBC settings, where we need to deal with unseen entities and relations that emerge in a constantly growing knowledge base. In this work, we systematically examine different possible scenarios of zero-shot KBC and develop a comprehensive benchmark, ZeroKBC, that covers these scenarios with diverse types of knowledge sources. Our systematic analysis reveals several missing yet important zero-shot KBC settings. Experimental results show that canonical and state-of-the-art KBC systems cannot achieve satisfactory performance on this challenging benchmark. By analyzing the strength and weaknesses of these systems on solving ZeroKBC, we further present several important observations and promising future directions.11Work was done during the internship at Tencent AI lab. The data and code are available at: https://github.com/brickee/ZeroKBC","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125352011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Identify malfunctions and their possible causes using rules, application to process mining 使用规则识别故障及其可能的原因,应用程序进行流程挖掘
Pub Date : 2022-11-01 DOI: 10.1109/ICDMW58026.2022.00023
Benoit Vuillemin, F. Bertrand
In the field of process mining, malfunction analysis is a major research domain. The goal here is to find failures or relatively large processing delays and their possible causes. This paper presents an innovative research paradigm for process mining: prediction rule mining. Through a three-step method and two new algorithms, all observed cases of a process are decomposed into rules, whose information is analyzed, and possible causes are searched. This method provides information about the data, from its internal structure to the possible causes of failures, without having a priori knowledge about them.
在工艺采矿领域中,故障分析是一个重要的研究领域。这里的目标是找到故障或相对较大的处理延迟及其可能的原因。本文提出了一种创新的过程挖掘研究范式:预测规则挖掘。通过三步法和两种新算法,将过程中观察到的所有案例分解为规则,对规则信息进行分析,寻找可能的原因。这种方法提供了关于数据的信息,从数据的内部结构到故障的可能原因,而不需要先验知识。
{"title":"Identify malfunctions and their possible causes using rules, application to process mining","authors":"Benoit Vuillemin, F. Bertrand","doi":"10.1109/ICDMW58026.2022.00023","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00023","url":null,"abstract":"In the field of process mining, malfunction analysis is a major research domain. The goal here is to find failures or relatively large processing delays and their possible causes. This paper presents an innovative research paradigm for process mining: prediction rule mining. Through a three-step method and two new algorithms, all observed cases of a process are decomposed into rules, whose information is analyzed, and possible causes are searched. This method provides information about the data, from its internal structure to the possible causes of failures, without having a priori knowledge about them.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"123 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129760751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mining High Utility Itemset with Multiple Minimum Utility Thresholds Based on Utility Deviation 基于效用偏差的多最小效用阈值高效用项集挖掘
Pub Date : 2022-11-01 DOI: 10.1109/ICDMW58026.2022.00071
Naji Alhusaini, Jing Li, Philippe Fournier-Viger, Ammar Hawbani, Guilin Chen
High Utility Itemset Mining (HUIM) is the task of extracting actionable patterns considering the utility of items such as profits and quantities. An important issue with traditional HUIM methods is that they evaluate all items using a single threshold, which is inconsistent with reality due to differences in the nature and importance of items. Recently, algorithms were proposed to address this problem by assigning a minimum item utility threshold to each item. However, since the minimum item utility (MIU) is expressed as a percentage of the external utility, these methods still face two problems, called “itemset missing” and “itemset explosion”. To solve these problems, this paper introduces a novel notion of Utility Deviation (UD), which is calculated based on the standard deviation. The U D and actual utility are jointly used to calculate the MIU of items. By doing so, the problems of “itemset missing” and “itemset explosion” are alleviated. To implement and evaluate the U D notion, a novel algorithm is proposed, called HUI-MMU-UD. Experimental results demonstrate the effectiveness of the proposed notion for solving the problems of “itemset missing” and “itemset explosion”. Results also show that the proposed algorithm outperforms the previous HUI-MMU algorithm in many cases, in terms of runtime and memory usage.
高效用项集挖掘(HUIM)是一项考虑利润和数量等项的效用提取可操作模式的任务。传统HUIM方法的一个重要问题是,它们使用单一阈值来评估所有项目,由于项目的性质和重要性的差异,这与现实不一致。最近提出了一种算法,通过为每个项目分配最小项目效用阈值来解决这个问题。然而,由于最小项目效用(MIU)是用外部效用的百分比表示的,这些方法仍然面临两个问题,称为“项目集缺失”和“项目集爆炸”。为了解决这些问题,本文引入了基于标准偏差计算的效用偏差的新概念。在计算项目的MIU时,采用了U D和实际效用相结合的方法。这样可以缓解“物品集缺失”和“物品集爆炸”的问题。为了实现和评估U- D概念,提出了一种新的算法,称为HUI-MMU-UD。实验结果证明了该方法在解决“项集缺失”和“项集爆炸”问题上的有效性。结果还表明,在运行时间和内存使用方面,该算法在许多情况下都优于以前的HUI-MMU算法。
{"title":"Mining High Utility Itemset with Multiple Minimum Utility Thresholds Based on Utility Deviation","authors":"Naji Alhusaini, Jing Li, Philippe Fournier-Viger, Ammar Hawbani, Guilin Chen","doi":"10.1109/ICDMW58026.2022.00071","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00071","url":null,"abstract":"High Utility Itemset Mining (HUIM) is the task of extracting actionable patterns considering the utility of items such as profits and quantities. An important issue with traditional HUIM methods is that they evaluate all items using a single threshold, which is inconsistent with reality due to differences in the nature and importance of items. Recently, algorithms were proposed to address this problem by assigning a minimum item utility threshold to each item. However, since the minimum item utility (MIU) is expressed as a percentage of the external utility, these methods still face two problems, called “itemset missing” and “itemset explosion”. To solve these problems, this paper introduces a novel notion of Utility Deviation (UD), which is calculated based on the standard deviation. The U D and actual utility are jointly used to calculate the MIU of items. By doing so, the problems of “itemset missing” and “itemset explosion” are alleviated. To implement and evaluate the U D notion, a novel algorithm is proposed, called HUI-MMU-UD. Experimental results demonstrate the effectiveness of the proposed notion for solving the problems of “itemset missing” and “itemset explosion”. Results also show that the proposed algorithm outperforms the previous HUI-MMU algorithm in many cases, in terms of runtime and memory usage.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"72 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126926019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Efficient and Reliable Tolerance- Based Algorithm for Principal Component Analysis 一种高效可靠的基于公差的主成分分析算法
Pub Date : 2022-11-01 DOI: 10.1109/ICDMW58026.2022.00088
Michael Yeh, Ming Gu
Principal component analysis (PCA) is an important method for dimensionality reduction in data science and machine learning. However, it is expensive for large matrices when only a few components are needed. Existing fast PCA algorithms typically assume the user will supply the number of components needed, but in practice, they may not know this number beforehand. Thus, it is important to have fast PCA algorithms depending on a tolerance. We develop one such algorithm that runs quickly for matrices with rapidly decaying singular values, provide approximation error bounds that are within a constant factor away from optimal, and demonstrate its utility with data from a variety of applications.
主成分分析(PCA)是数据科学和机器学习中重要的降维方法。然而,当只需要几个组件时,对于大型矩阵来说,这是昂贵的。现有的快速PCA算法通常假设用户将提供所需组件的数量,但在实践中,他们可能事先不知道这个数量。因此,基于容差的快速PCA算法非常重要。我们开发了一种这样的算法,它可以快速运行具有快速衰减的奇异值的矩阵,提供距离最优值在常数因子范围内的近似误差界限,并通过来自各种应用程序的数据演示其实用性。
{"title":"An Efficient and Reliable Tolerance- Based Algorithm for Principal Component Analysis","authors":"Michael Yeh, Ming Gu","doi":"10.1109/ICDMW58026.2022.00088","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00088","url":null,"abstract":"Principal component analysis (PCA) is an important method for dimensionality reduction in data science and machine learning. However, it is expensive for large matrices when only a few components are needed. Existing fast PCA algorithms typically assume the user will supply the number of components needed, but in practice, they may not know this number beforehand. Thus, it is important to have fast PCA algorithms depending on a tolerance. We develop one such algorithm that runs quickly for matrices with rapidly decaying singular values, provide approximation error bounds that are within a constant factor away from optimal, and demonstrate its utility with data from a variety of applications.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"935 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123062528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Identifying Hydrometeorological Factors Influencing Reservoir Releases Using Machine Learning Methods 利用机器学习方法识别影响水库释放的水文气象因素
Pub Date : 2022-11-01 DOI: 10.1109/ICDMW58026.2022.00143
Ming Fan, Lujun Zhang, Siyan Liu, Tiantian Yang, Dawei Lu
Simulation of reservoir releases plays a critical role in social-economic functioning and our nation's security. How-ever, it is challenging to predict the reservoir release accurately because of many influential factors from natural environments and engineering controls such as the reservoir inflow and storage. Moreover, climate change and hydrological intensification causing the extreme precipitation and temperature make the accurate prediction of reservoir releases even more challenging. Machine learning (ML) methods have shown some successful applications in simulating reservoir releases. However, previous studies mainly used inflow and storage data as inputs and only considered their short-term influences (e.g, previous one or two days). In this work, we use long short-term memory (LSTM) networks for reservoir release prediction based on four input variables including inflow, storage, precipitation, and temperature and consider their long-term influences. We apply the LSTM model to 30 reservoirs in Upper Colorado River Basin, United States. We analyze the prediction performance using six statistical metrics. More importantly, we investigate the influence of the input hydrometeorological factors, as well as their temporal effects on reservoir release decisions. Results indicate that inflow and storage are the most influential factors but the inclusion of precipitation and temperature can further improve the prediction of release especially in low flows. Additionally, the inflow and storage have a relatively long-term effect on the release. These findings can help optimize the water resources management in the reservoirs.
水库放水模拟对社会经济运行和国家安全具有重要意义。然而,由于受自然环境和工程控制因素的影响,如水库入库、库容等,对水库释放量的准确预测具有一定的挑战性。此外,气候变化和水文加剧导致极端降水和温度,使水库释放的准确预测更具挑战性。机器学习(ML)方法在模拟油藏释放方面已经取得了一些成功的应用。然而,以往的研究主要使用流入和储存数据作为输入,只考虑它们的短期影响(如前一天或两天)。在这项工作中,我们使用长短期记忆(LSTM)网络进行水库释放预测,该预测基于四个输入变量,包括流入、储存、降水和温度,并考虑它们的长期影响。将LSTM模型应用于美国上科罗拉多河流域的30个水库。我们使用六个统计指标来分析预测性能。更重要的是,我们研究了输入水文气象因子的影响,以及它们对水库放水决策的时间效应。结果表明,入水量和库存量是影响最大的因素,但降水和温度的加入可以进一步改善对释放量的预测,尤其是在低流量的情况下。此外,流入和储存对释放有相对长期的影响。这些发现有助于水库水资源管理的优化。
{"title":"Identifying Hydrometeorological Factors Influencing Reservoir Releases Using Machine Learning Methods","authors":"Ming Fan, Lujun Zhang, Siyan Liu, Tiantian Yang, Dawei Lu","doi":"10.1109/ICDMW58026.2022.00143","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00143","url":null,"abstract":"Simulation of reservoir releases plays a critical role in social-economic functioning and our nation's security. How-ever, it is challenging to predict the reservoir release accurately because of many influential factors from natural environments and engineering controls such as the reservoir inflow and storage. Moreover, climate change and hydrological intensification causing the extreme precipitation and temperature make the accurate prediction of reservoir releases even more challenging. Machine learning (ML) methods have shown some successful applications in simulating reservoir releases. However, previous studies mainly used inflow and storage data as inputs and only considered their short-term influences (e.g, previous one or two days). In this work, we use long short-term memory (LSTM) networks for reservoir release prediction based on four input variables including inflow, storage, precipitation, and temperature and consider their long-term influences. We apply the LSTM model to 30 reservoirs in Upper Colorado River Basin, United States. We analyze the prediction performance using six statistical metrics. More importantly, we investigate the influence of the input hydrometeorological factors, as well as their temporal effects on reservoir release decisions. Results indicate that inflow and storage are the most influential factors but the inclusion of precipitation and temperature can further improve the prediction of release especially in low flows. Additionally, the inflow and storage have a relatively long-term effect on the release. These findings can help optimize the water resources management in the reservoirs.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127820176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Deep-SHEEP: Sense of Humor Extraction from Embeddings in the Personalized Context Deep-SHEEP:个性化语境下嵌入的幽默感提取
Pub Date : 2022-11-01 DOI: 10.1109/ICDMW58026.2022.00125
Julita Bielaniewicz, Kamil Kanclerz, P. Milkowski, Marcin Gruza, Konrad Karanowski, Przemyslaw Kazienko, Jan Kocoń
As humans, we experience a wide range of feelings and reactions. One of these is laughter, often related to a personal sense of humor and the perception of funny content. Due to its subjective nature, recognizing humor in NLP is a very challenging task. Here, we present a new approach to the task of predicting humor in the text by applying the idea of a personalized approach. It takes into account both the text and the context of the content receiver. For that purpose, we proposed four Deep-SHEEP learning models that take advantage of user preference information differently. The experiments were conducted on four datasets: Cockamamie, HUMOR, Jester, and Humicroedit. The results have shown that the application of an innovative personalized approach and user-centric perspective significantly improves performance compared to generalized methods. Moreover, even for random text embeddings, our personalized methods outperform the generalized ones in the subjective humor modeling task. We also argue that the user-related data reflecting an individual sense of humor has similar importance as the evaluated text itself. Different types of humor were investigated as well.
作为人类,我们会经历各种各样的感受和反应。其中之一是笑,通常与个人幽默感和对有趣内容的感知有关。由于其主观性,识别NLP中的幽默是一项非常具有挑战性的任务。在这里,我们提出了一种新的方法,通过应用个性化的方法来预测文本中的幽默。它同时考虑到文本和内容接收者的上下文。为此,我们提出了四种不同利用用户偏好信息的Deep-SHEEP学习模型。实验在四个数据集上进行:Cockamamie, HUMOR, Jester和Humicroedit。结果表明,与通用方法相比,创新的个性化方法和以用户为中心的观点的应用显着提高了性能。此外,即使对于随机文本嵌入,我们的个性化方法在主观幽默建模任务中也优于广义方法。我们还认为,反映个人幽默感的用户相关数据与评估文本本身具有相似的重要性。不同类型的幽默也被调查。
{"title":"Deep-SHEEP: Sense of Humor Extraction from Embeddings in the Personalized Context","authors":"Julita Bielaniewicz, Kamil Kanclerz, P. Milkowski, Marcin Gruza, Konrad Karanowski, Przemyslaw Kazienko, Jan Kocoń","doi":"10.1109/ICDMW58026.2022.00125","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00125","url":null,"abstract":"As humans, we experience a wide range of feelings and reactions. One of these is laughter, often related to a personal sense of humor and the perception of funny content. Due to its subjective nature, recognizing humor in NLP is a very challenging task. Here, we present a new approach to the task of predicting humor in the text by applying the idea of a personalized approach. It takes into account both the text and the context of the content receiver. For that purpose, we proposed four Deep-SHEEP learning models that take advantage of user preference information differently. The experiments were conducted on four datasets: Cockamamie, HUMOR, Jester, and Humicroedit. The results have shown that the application of an innovative personalized approach and user-centric perspective significantly improves performance compared to generalized methods. Moreover, even for random text embeddings, our personalized methods outperform the generalized ones in the subjective humor modeling task. We also argue that the user-related data reflecting an individual sense of humor has similar importance as the evaluated text itself. Different types of humor were investigated as well.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133326524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
2022 IEEE International Conference on Data Mining Workshops (ICDMW)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1