首页 > 最新文献

2022 IEEE International Conference on Data Mining Workshops (ICDMW)最新文献

英文 中文
A Comparison of Ambulance Redeployment Systems on Real-World Data 基于真实世界数据的救护车重新部署系统的比较
Pub Date : 2022-11-01 DOI: 10.1109/ICDMW58026.2022.00010
Niklas Strauß, Max Berrendorf, Tom Haider, M. Schubert
Modern Emergency Medical Services (EMS) benefit from real-time sensor information in various ways as they provide up-to-date location information and help assess current local emergency risks. A critical part of EMS is dynamic ambulance redeployment, i.e., the task of assigning idle ambulances to base stations throughout a community. Although there has been a considerable effort on methods to optimize emergency response systems, a comparison of proposed methods is generally difficult as reported results are mostly based on artificial and proprietary test beds. In this paper, we present a benchmark simulation environment for dynamic ambulance redeployment based on real emergency data from the city of San Francisco. Our proposed simulation environment is highly scalable and is compatible with modern reinforcement learning frameworks. We provide a comparative study of several state-of-the-art methods for various metrics. Results indicate that even simple baseline algorithms can perform considerably well in close-to-realistic settings. The code of our simulator is openly available at https://github.com/niklasdbs/ambusim.
现代紧急医疗服务(EMS)以各种方式受益于实时传感器信息,因为它们提供最新的位置信息并帮助评估当前的当地紧急风险。EMS的一个关键部分是动态救护车重新部署,即将空闲救护车分配到整个社区的基站的任务。尽管在优化应急响应系统的方法上已经做出了相当大的努力,但由于报告的结果大多基于人工和专有的试验台,因此通常很难对所提出的方法进行比较。在本文中,我们提出了一个基于旧金山市真实急救数据的动态救护车调配基准模拟环境。我们提出的仿真环境具有高度可扩展性,并且与现代强化学习框架兼容。我们提供了几种最先进的方法对各种指标的比较研究。结果表明,即使是简单的基线算法也可以在接近现实的设置中表现得相当好。我们的模拟器的代码可以在https://github.com/niklasdbs/ambusim上公开获得。
{"title":"A Comparison of Ambulance Redeployment Systems on Real-World Data","authors":"Niklas Strauß, Max Berrendorf, Tom Haider, M. Schubert","doi":"10.1109/ICDMW58026.2022.00010","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00010","url":null,"abstract":"Modern Emergency Medical Services (EMS) benefit from real-time sensor information in various ways as they provide up-to-date location information and help assess current local emergency risks. A critical part of EMS is dynamic ambulance redeployment, i.e., the task of assigning idle ambulances to base stations throughout a community. Although there has been a considerable effort on methods to optimize emergency response systems, a comparison of proposed methods is generally difficult as reported results are mostly based on artificial and proprietary test beds. In this paper, we present a benchmark simulation environment for dynamic ambulance redeployment based on real emergency data from the city of San Francisco. Our proposed simulation environment is highly scalable and is compatible with modern reinforcement learning frameworks. We provide a comparative study of several state-of-the-art methods for various metrics. Results indicate that even simple baseline algorithms can perform considerably well in close-to-realistic settings. The code of our simulator is openly available at https://github.com/niklasdbs/ambusim.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133739470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A study of automatic speech recognition in Portuguese by the Brazilian General Attorney of the Union 在葡萄牙语自动语音识别的研究由巴西总检察长联盟
Pub Date : 2022-11-01 DOI: 10.1109/ICDMW58026.2022.00038
Rodrigo Fay Verqara, Paulo Henrique dos Santos, Guilherme Fay Verqara, Fábio L. L. Mendonça, C. E. L. Veiga, B. Praciano, Daniel Alves da Silva, Rafael Timóteo de Sousa Júnior
This article presents a study of an automatic speech recognition system in Portuguese applied to videos by the General Attorney of the Union of Brazil. As they are confidential videos, using proprietary software from large companies is not allowed for security reasons. Thus, constructing an artificial intelligence model capable of performing automatic speech recognition in Portuguese in the judicial context and making this model available for large-scale inference is critical to maintaining data security. For this purpose, a dataset in Brazilian Portuguese was used by a combination of 3 datasets already built. The system used TDNN Jasper and QuartzNet architectures for network training, obtaining promising preliminary results, having a word error rate (WER) of 56% without using a linguistic model.
本文介绍了巴西联邦总检察长应用于视频的葡萄牙语自动语音识别系统的研究。由于是机密视频,因此出于安全考虑,不允许使用大公司的专有软件。因此,构建一个能够在司法环境中执行葡萄牙语自动语音识别的人工智能模型,并使该模型可用于大规模推理,对于维护数据安全至关重要。为此,一个巴西葡萄牙语的数据集被3个已经建立的数据集组合使用。该系统使用TDNN Jasper和QuartzNet架构进行网络训练,获得了有希望的初步结果,在不使用语言模型的情况下,单词错误率(WER)为56%。
{"title":"A study of automatic speech recognition in Portuguese by the Brazilian General Attorney of the Union","authors":"Rodrigo Fay Verqara, Paulo Henrique dos Santos, Guilherme Fay Verqara, Fábio L. L. Mendonça, C. E. L. Veiga, B. Praciano, Daniel Alves da Silva, Rafael Timóteo de Sousa Júnior","doi":"10.1109/ICDMW58026.2022.00038","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00038","url":null,"abstract":"This article presents a study of an automatic speech recognition system in Portuguese applied to videos by the General Attorney of the Union of Brazil. As they are confidential videos, using proprietary software from large companies is not allowed for security reasons. Thus, constructing an artificial intelligence model capable of performing automatic speech recognition in Portuguese in the judicial context and making this model available for large-scale inference is critical to maintaining data security. For this purpose, a dataset in Brazilian Portuguese was used by a combination of 3 datasets already built. The system used TDNN Jasper and QuartzNet architectures for network training, obtaining promising preliminary results, having a word error rate (WER) of 56% without using a linguistic model.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132387894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Identifying Hydrometeorological Factors Influencing Reservoir Releases Using Machine Learning Methods 利用机器学习方法识别影响水库释放的水文气象因素
Pub Date : 2022-11-01 DOI: 10.1109/ICDMW58026.2022.00143
Ming Fan, Lujun Zhang, Siyan Liu, Tiantian Yang, Dawei Lu
Simulation of reservoir releases plays a critical role in social-economic functioning and our nation's security. How-ever, it is challenging to predict the reservoir release accurately because of many influential factors from natural environments and engineering controls such as the reservoir inflow and storage. Moreover, climate change and hydrological intensification causing the extreme precipitation and temperature make the accurate prediction of reservoir releases even more challenging. Machine learning (ML) methods have shown some successful applications in simulating reservoir releases. However, previous studies mainly used inflow and storage data as inputs and only considered their short-term influences (e.g, previous one or two days). In this work, we use long short-term memory (LSTM) networks for reservoir release prediction based on four input variables including inflow, storage, precipitation, and temperature and consider their long-term influences. We apply the LSTM model to 30 reservoirs in Upper Colorado River Basin, United States. We analyze the prediction performance using six statistical metrics. More importantly, we investigate the influence of the input hydrometeorological factors, as well as their temporal effects on reservoir release decisions. Results indicate that inflow and storage are the most influential factors but the inclusion of precipitation and temperature can further improve the prediction of release especially in low flows. Additionally, the inflow and storage have a relatively long-term effect on the release. These findings can help optimize the water resources management in the reservoirs.
水库放水模拟对社会经济运行和国家安全具有重要意义。然而,由于受自然环境和工程控制因素的影响,如水库入库、库容等,对水库释放量的准确预测具有一定的挑战性。此外,气候变化和水文加剧导致极端降水和温度,使水库释放的准确预测更具挑战性。机器学习(ML)方法在模拟油藏释放方面已经取得了一些成功的应用。然而,以往的研究主要使用流入和储存数据作为输入,只考虑它们的短期影响(如前一天或两天)。在这项工作中,我们使用长短期记忆(LSTM)网络进行水库释放预测,该预测基于四个输入变量,包括流入、储存、降水和温度,并考虑它们的长期影响。将LSTM模型应用于美国上科罗拉多河流域的30个水库。我们使用六个统计指标来分析预测性能。更重要的是,我们研究了输入水文气象因子的影响,以及它们对水库放水决策的时间效应。结果表明,入水量和库存量是影响最大的因素,但降水和温度的加入可以进一步改善对释放量的预测,尤其是在低流量的情况下。此外,流入和储存对释放有相对长期的影响。这些发现有助于水库水资源管理的优化。
{"title":"Identifying Hydrometeorological Factors Influencing Reservoir Releases Using Machine Learning Methods","authors":"Ming Fan, Lujun Zhang, Siyan Liu, Tiantian Yang, Dawei Lu","doi":"10.1109/ICDMW58026.2022.00143","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00143","url":null,"abstract":"Simulation of reservoir releases plays a critical role in social-economic functioning and our nation's security. How-ever, it is challenging to predict the reservoir release accurately because of many influential factors from natural environments and engineering controls such as the reservoir inflow and storage. Moreover, climate change and hydrological intensification causing the extreme precipitation and temperature make the accurate prediction of reservoir releases even more challenging. Machine learning (ML) methods have shown some successful applications in simulating reservoir releases. However, previous studies mainly used inflow and storage data as inputs and only considered their short-term influences (e.g, previous one or two days). In this work, we use long short-term memory (LSTM) networks for reservoir release prediction based on four input variables including inflow, storage, precipitation, and temperature and consider their long-term influences. We apply the LSTM model to 30 reservoirs in Upper Colorado River Basin, United States. We analyze the prediction performance using six statistical metrics. More importantly, we investigate the influence of the input hydrometeorological factors, as well as their temporal effects on reservoir release decisions. Results indicate that inflow and storage are the most influential factors but the inclusion of precipitation and temperature can further improve the prediction of release especially in low flows. Additionally, the inflow and storage have a relatively long-term effect on the release. These findings can help optimize the water resources management in the reservoirs.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127820176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Making Sense of Sentiments for Aesthetic Plastic Surgery 审美整形手术的情感理解
Pub Date : 2022-11-01 DOI: 10.1109/ICDMW58026.2022.00061
A. Choudhary, E. Cambria
With social media pervading all aspects of our life, the opinions expressed by netizens are a gold mine ready to be exploited in a meaningful way to influence all major public do-mains. Sentiment analysis is a way to interpret this unstructured data using AI tools. It is a well-known fact that there has been a 'Zoom Boom’ in the field of aesthetic plastic surgery due to the COVID-19 pandemic and the same has put the focus of attention sharply on our appearance. Polarity detection of tweets published on popular aesthetic plastic surgery procedures before and after the onset of COVID can provide great insights for aesthetic plastic surgeons and the health industry at large. In this work, we develop an end-to-end system for the sentiment analysis of such tweets incorporating a state-of-the-art fine-tuned deep learning model, an ingenious 'keyword search and filter approach’ and SenticNet. Our system was tested on a large database of 196,900 tweets and the results were visualized using affectively correct word clouds and also subjected to rigorous statistical hypothesis testing to draw meaningful inferences. The results showed a high level of statistical significance.
随着社交媒体渗透到我们生活的方方面面,网民表达的意见是一座金矿,随时可以被利用,以一种有意义的方式影响所有主要的公共事务。情感分析是一种使用人工智能工具解释这种非结构化数据的方法。受新冠肺炎疫情影响,美容整形领域出现了“变焦热潮”,这是众所周知的事实,人们的注意力也集中在了我们的外表上。对新冠肺炎疫情前后发布的流行整形手术推文进行极性检测,可以为整形外科医生和整个健康行业提供很好的见解。在这项工作中,我们开发了一个端到端系统,用于对此类推文进行情感分析,该系统结合了最先进的微调深度学习模型、巧妙的“关键字搜索和过滤方法”以及SenticNet。我们的系统在一个包含196,900条推文的大型数据库上进行了测试,使用有效校正的词云将结果可视化,并进行了严格的统计假设检验,以得出有意义的推论。结果具有高度的统计学意义。
{"title":"Making Sense of Sentiments for Aesthetic Plastic Surgery","authors":"A. Choudhary, E. Cambria","doi":"10.1109/ICDMW58026.2022.00061","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00061","url":null,"abstract":"With social media pervading all aspects of our life, the opinions expressed by netizens are a gold mine ready to be exploited in a meaningful way to influence all major public do-mains. Sentiment analysis is a way to interpret this unstructured data using AI tools. It is a well-known fact that there has been a 'Zoom Boom’ in the field of aesthetic plastic surgery due to the COVID-19 pandemic and the same has put the focus of attention sharply on our appearance. Polarity detection of tweets published on popular aesthetic plastic surgery procedures before and after the onset of COVID can provide great insights for aesthetic plastic surgeons and the health industry at large. In this work, we develop an end-to-end system for the sentiment analysis of such tweets incorporating a state-of-the-art fine-tuned deep learning model, an ingenious 'keyword search and filter approach’ and SenticNet. Our system was tested on a large database of 196,900 tweets and the results were visualized using affectively correct word clouds and also subjected to rigorous statistical hypothesis testing to draw meaningful inferences. The results showed a high level of statistical significance.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131449944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Identify malfunctions and their possible causes using rules, application to process mining 使用规则识别故障及其可能的原因,应用程序进行流程挖掘
Pub Date : 2022-11-01 DOI: 10.1109/ICDMW58026.2022.00023
Benoit Vuillemin, F. Bertrand
In the field of process mining, malfunction analysis is a major research domain. The goal here is to find failures or relatively large processing delays and their possible causes. This paper presents an innovative research paradigm for process mining: prediction rule mining. Through a three-step method and two new algorithms, all observed cases of a process are decomposed into rules, whose information is analyzed, and possible causes are searched. This method provides information about the data, from its internal structure to the possible causes of failures, without having a priori knowledge about them.
在工艺采矿领域中,故障分析是一个重要的研究领域。这里的目标是找到故障或相对较大的处理延迟及其可能的原因。本文提出了一种创新的过程挖掘研究范式:预测规则挖掘。通过三步法和两种新算法,将过程中观察到的所有案例分解为规则,对规则信息进行分析,寻找可能的原因。这种方法提供了关于数据的信息,从数据的内部结构到故障的可能原因,而不需要先验知识。
{"title":"Identify malfunctions and their possible causes using rules, application to process mining","authors":"Benoit Vuillemin, F. Bertrand","doi":"10.1109/ICDMW58026.2022.00023","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00023","url":null,"abstract":"In the field of process mining, malfunction analysis is a major research domain. The goal here is to find failures or relatively large processing delays and their possible causes. This paper presents an innovative research paradigm for process mining: prediction rule mining. Through a three-step method and two new algorithms, all observed cases of a process are decomposed into rules, whose information is analyzed, and possible causes are searched. This method provides information about the data, from its internal structure to the possible causes of failures, without having a priori knowledge about them.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"123 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129760751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep-SHEEP: Sense of Humor Extraction from Embeddings in the Personalized Context Deep-SHEEP:个性化语境下嵌入的幽默感提取
Pub Date : 2022-11-01 DOI: 10.1109/ICDMW58026.2022.00125
Julita Bielaniewicz, Kamil Kanclerz, P. Milkowski, Marcin Gruza, Konrad Karanowski, Przemyslaw Kazienko, Jan Kocoń
As humans, we experience a wide range of feelings and reactions. One of these is laughter, often related to a personal sense of humor and the perception of funny content. Due to its subjective nature, recognizing humor in NLP is a very challenging task. Here, we present a new approach to the task of predicting humor in the text by applying the idea of a personalized approach. It takes into account both the text and the context of the content receiver. For that purpose, we proposed four Deep-SHEEP learning models that take advantage of user preference information differently. The experiments were conducted on four datasets: Cockamamie, HUMOR, Jester, and Humicroedit. The results have shown that the application of an innovative personalized approach and user-centric perspective significantly improves performance compared to generalized methods. Moreover, even for random text embeddings, our personalized methods outperform the generalized ones in the subjective humor modeling task. We also argue that the user-related data reflecting an individual sense of humor has similar importance as the evaluated text itself. Different types of humor were investigated as well.
作为人类,我们会经历各种各样的感受和反应。其中之一是笑,通常与个人幽默感和对有趣内容的感知有关。由于其主观性,识别NLP中的幽默是一项非常具有挑战性的任务。在这里,我们提出了一种新的方法,通过应用个性化的方法来预测文本中的幽默。它同时考虑到文本和内容接收者的上下文。为此,我们提出了四种不同利用用户偏好信息的Deep-SHEEP学习模型。实验在四个数据集上进行:Cockamamie, HUMOR, Jester和Humicroedit。结果表明,与通用方法相比,创新的个性化方法和以用户为中心的观点的应用显着提高了性能。此外,即使对于随机文本嵌入,我们的个性化方法在主观幽默建模任务中也优于广义方法。我们还认为,反映个人幽默感的用户相关数据与评估文本本身具有相似的重要性。不同类型的幽默也被调查。
{"title":"Deep-SHEEP: Sense of Humor Extraction from Embeddings in the Personalized Context","authors":"Julita Bielaniewicz, Kamil Kanclerz, P. Milkowski, Marcin Gruza, Konrad Karanowski, Przemyslaw Kazienko, Jan Kocoń","doi":"10.1109/ICDMW58026.2022.00125","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00125","url":null,"abstract":"As humans, we experience a wide range of feelings and reactions. One of these is laughter, often related to a personal sense of humor and the perception of funny content. Due to its subjective nature, recognizing humor in NLP is a very challenging task. Here, we present a new approach to the task of predicting humor in the text by applying the idea of a personalized approach. It takes into account both the text and the context of the content receiver. For that purpose, we proposed four Deep-SHEEP learning models that take advantage of user preference information differently. The experiments were conducted on four datasets: Cockamamie, HUMOR, Jester, and Humicroedit. The results have shown that the application of an innovative personalized approach and user-centric perspective significantly improves performance compared to generalized methods. Moreover, even for random text embeddings, our personalized methods outperform the generalized ones in the subjective humor modeling task. We also argue that the user-related data reflecting an individual sense of humor has similar importance as the evaluated text itself. Different types of humor were investigated as well.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133326524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
An application of Customer Embedding for Clustering 客户嵌入在聚类中的应用
Pub Date : 2022-11-01 DOI: 10.1109/ICDMW58026.2022.00019
Ahmet Tugrul Bayrak
Effective and powerful strategic planning in a competitive business environment brings businesses to the fore. It is important for the growth of the business to move the customer to the center by acting more intelligently in the planning of marketing and sales activities. In order to find customer behavior patterns, the use of clustering models from machine learning algorithms can yield effective results. In this study, traditional customer clustering methods are enriched by using customer representations as features. To be able to achieve that, a natural language processing method, word embedding, is applied to customers. By using the powerful mechanism of word embedding methods, a customer space is created where the customers are represented based on the products they have bought. It is observed that appending customer embeddings for customer clustering have a positive effect and the results seem promising for further studies.
在竞争激烈的商业环境中,有效而有力的战略规划使企业脱颖而出。通过在市场营销和销售活动的计划中采取更明智的行动,将客户转移到中心位置,这对业务的增长非常重要。为了发现客户行为模式,使用机器学习算法中的聚类模型可以产生有效的结果。在本研究中,利用客户表征作为特征,丰富了传统的客户聚类方法。为了实现这一目标,一种自然语言处理方法——词嵌入——被应用于客户。通过使用强大的词嵌入方法机制,创建了一个客户空间,其中客户是根据他们购买的产品来表示的。观察到附加顾客嵌入对顾客聚类有积极的影响,结果表明有进一步研究的前景。
{"title":"An application of Customer Embedding for Clustering","authors":"Ahmet Tugrul Bayrak","doi":"10.1109/ICDMW58026.2022.00019","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00019","url":null,"abstract":"Effective and powerful strategic planning in a competitive business environment brings businesses to the fore. It is important for the growth of the business to move the customer to the center by acting more intelligently in the planning of marketing and sales activities. In order to find customer behavior patterns, the use of clustering models from machine learning algorithms can yield effective results. In this study, traditional customer clustering methods are enriched by using customer representations as features. To be able to achieve that, a natural language processing method, word embedding, is applied to customers. By using the powerful mechanism of word embedding methods, a customer space is created where the customers are represented based on the products they have bought. It is observed that appending customer embeddings for customer clustering have a positive effect and the results seem promising for further studies.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124931831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Linguistic Knowledge Application to Neuro-Symbolic Transformers in Sentiment Analysis 语言知识在情感分析中神经符号转换中的应用
Pub Date : 2022-11-01 DOI: 10.1109/ICDMW58026.2022.00059
Joanna Baran, Jan Kocoń
Neuro-symbolic approaches explore ways to com-bine neural networks with traditional symbolic knowledge. These methods are gaining attention due to their efficiency and the requirement of fewer data compared to currently used deep models. This work investigated several neuro-symbolic models for sentiment analysis focusing on a variety of ways to add linguistic knowledge to the transformer-based architecture. English and Polish WordNets were used as a knowledge source with their polarity extensions (SentiWordNet, plWordNet Emo). The neuro- symbolic methods using knowledge during fine-tuning were not better or worse than the baseline model. However, a statistically significant gain of about three percentage points in the Fl- macro was obtained for the SentiLARE model that applied domain data - word sentiment labels - already at the pretraining stage. It was the most visible for medium-sized training sets. Therefore, developing an effective neuro-symbolic model is not trivial. The conclusions drawn from this work indicate a further need for a detailed study of these approaches, especially in natural language processing. In the context of sentiment classification, it could help design more efficient AI systems that can be deployed in business or marketing.
神经符号方法探索将神经网络与传统符号知识相结合的方法。与目前使用的深度模型相比,这些方法由于其效率和对数据的要求更少而受到关注。这项工作研究了几种用于情感分析的神经符号模型,重点是将语言知识添加到基于转换器的架构中的各种方法。英语和波兰语的wordnet作为知识来源,它们的极性扩展(SentiWordNet, plWordNet Emo)。在微调过程中使用知识的神经符号方法并不比基线模型更好或更差。然而,对于已经在预训练阶段应用领域数据(词情感标签)的SentiLARE模型,在Fl- macro中获得了大约3个百分点的统计显著增益。对于中等规模的训练集,这是最明显的。因此,开发一个有效的神经符号模型并非易事。从这项工作中得出的结论表明,需要进一步详细研究这些方法,特别是在自然语言处理方面。在情感分类的背景下,它可以帮助设计更有效的人工智能系统,可以部署在商业或营销中。
{"title":"Linguistic Knowledge Application to Neuro-Symbolic Transformers in Sentiment Analysis","authors":"Joanna Baran, Jan Kocoń","doi":"10.1109/ICDMW58026.2022.00059","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00059","url":null,"abstract":"Neuro-symbolic approaches explore ways to com-bine neural networks with traditional symbolic knowledge. These methods are gaining attention due to their efficiency and the requirement of fewer data compared to currently used deep models. This work investigated several neuro-symbolic models for sentiment analysis focusing on a variety of ways to add linguistic knowledge to the transformer-based architecture. English and Polish WordNets were used as a knowledge source with their polarity extensions (SentiWordNet, plWordNet Emo). The neuro- symbolic methods using knowledge during fine-tuning were not better or worse than the baseline model. However, a statistically significant gain of about three percentage points in the Fl- macro was obtained for the SentiLARE model that applied domain data - word sentiment labels - already at the pretraining stage. It was the most visible for medium-sized training sets. Therefore, developing an effective neuro-symbolic model is not trivial. The conclusions drawn from this work indicate a further need for a detailed study of these approaches, especially in natural language processing. In the context of sentiment classification, it could help design more efficient AI systems that can be deployed in business or marketing.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124004077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Using spatial data and cluster analysis to automatically detect non-trivial relationships between environmental transgressors 利用空间数据和聚类分析自动检测环境违规者之间的非琐碎关系
Pub Date : 2022-11-01 DOI: 10.1109/ICDMW58026.2022.00022
José Alberto Sousa Torres, Paulo Henrique dos Santos, Daniel Alves da Silva, C. E. L. Veiga, Márcio Bastos Medeiros, Guilherme Fay Verqara, Fábio L. L. Mendonça, Rafael Timóteo de Sousa Júnior
The Amazon Rainforest is the most significant biodiversi-ty reserve on the planet. It plays a central role in combating global warming and climate change on the Earth. De-spite its importance, in 2021, the illegal deforestation process in the Brazilian Amazon rainforest had the worst year in a decade. The data show that more than 10,000 kilometers of native forest were destroyed that year-an increase of 29% compared to 2020. To fight against the action of deforesters, Brazilian environmental inspection agencies imposed more than 14 billion dollars in environmental fines in recent decades. However, it has not effectively reduced deforestation as only 4% of this amount was effectively collected-not inhibiting lawbreakers from deforesting. This is due to the difficulty of identifying the real transgressors, who use scapegoats to hide their crimes. The main objective of this paper is to propose an approach to find the real environmental transgressors through the analysis of data related to the fines imposed by Brazilian governmental agencies in the last three decades. We propose a method that employ clustering techniques in geo-graphic and temporal data extracted from fines to identify non-trivial correlations between scapegoats and large landowners. The automatically identified links were load-ed into a graph analysis database for accuracy assessment. The observed results were positive and indicated that this strategy could effectively identify the real culprits.
亚马逊雨林是地球上最重要的生物多样性保护区。它在应对全球变暖和地球气候变化方面发挥着核心作用。尽管它很重要,但在2021年,巴西亚马逊雨林的非法砍伐过程是十年来最严重的一年。数据显示,那一年有超过1万公里的原始森林被破坏,比2020年增加了29%。为了打击毁林者的行为,近几十年来,巴西环境检查机构征收了140多亿美元的环境罚款。然而,它并没有有效地减少森林砍伐,因为只有4%的森林被有效收集,这并没有阻止不法分子砍伐森林。这是因为很难识别真正的罪犯,他们用替罪羊来掩盖自己的罪行。本文的主要目的是提出一种方法,通过分析与巴西政府机构在过去三十年中施加的罚款有关的数据,找到真正的环境违规者。我们提出了一种方法,在从罚款中提取的地理和时间数据中使用聚类技术来识别替罪羊和大土地所有者之间的非琐碎相关性。自动识别的链接被加载到图形分析数据库中进行准确性评估。观察结果是积极的,表明该策略可以有效地识别真正的罪魁祸首。
{"title":"Using spatial data and cluster analysis to automatically detect non-trivial relationships between environmental transgressors","authors":"José Alberto Sousa Torres, Paulo Henrique dos Santos, Daniel Alves da Silva, C. E. L. Veiga, Márcio Bastos Medeiros, Guilherme Fay Verqara, Fábio L. L. Mendonça, Rafael Timóteo de Sousa Júnior","doi":"10.1109/ICDMW58026.2022.00022","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00022","url":null,"abstract":"The Amazon Rainforest is the most significant biodiversi-ty reserve on the planet. It plays a central role in combating global warming and climate change on the Earth. De-spite its importance, in 2021, the illegal deforestation process in the Brazilian Amazon rainforest had the worst year in a decade. The data show that more than 10,000 kilometers of native forest were destroyed that year-an increase of 29% compared to 2020. To fight against the action of deforesters, Brazilian environmental inspection agencies imposed more than 14 billion dollars in environmental fines in recent decades. However, it has not effectively reduced deforestation as only 4% of this amount was effectively collected-not inhibiting lawbreakers from deforesting. This is due to the difficulty of identifying the real transgressors, who use scapegoats to hide their crimes. The main objective of this paper is to propose an approach to find the real environmental transgressors through the analysis of data related to the fines imposed by Brazilian governmental agencies in the last three decades. We propose a method that employ clustering techniques in geo-graphic and temporal data extracted from fines to identify non-trivial correlations between scapegoats and large landowners. The automatically identified links were load-ed into a graph analysis database for accuracy assessment. The observed results were positive and indicated that this strategy could effectively identify the real culprits.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128219766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Interpreting Categorical Data Classifiers using Explanation-based Locality 使用基于解释的局部性解释分类数据分类器
Pub Date : 2022-11-01 DOI: 10.1109/ICDMW58026.2022.00030
P. Rasouli, Ingrid Chieh Yu, E. Jiménez-Ruiz
Local surrogate explanation methods are a popular class of post-hoc interpretability approaches that explain the rationale of machine learning models in the locality of every particular instance. Fidelity, which refers to the accuracy of explanation methods in imitating the actual behavior of a model, is highly affected by their strategy for identifying the locality of instances. To find the locality of an instance, we need to calculate the distance between the instance and perturbed data points concerning categorical and numerical features. While the distance of numerical features can be measured precisely, the existing works usually adopt a coarse-grained or imprecise approach for comparing categorical features. This is especially problematic in the categorical data setting, where defining a representative locality demands fine-grained semantic similarity information between categories. In this paper, we propose a locality generation approach for categorical data classifiers that makes no assumption about domain knowledge and infers categorical similarities by relying on the model's explanations. Further, we devise a multi-centered sampling approach based on the derived similarity information that, compared to the conventional instance-centered technique, captures the local behavior of the model more effectively. Moreover, we develop a knowledge-based locality generation approach based on knowledge graphs to benchmark our explanation-based method against a scenario where the similarity information is provided by a domain expert. The experiments conducted on various data sets demonstrate the efficacy of our approach in generating faithful explanations.
局部代理解释方法是一类流行的事后可解释性方法,用于解释机器学习模型在每个特定实例的局域性中的基本原理。保真度是指解释方法在模仿模型实际行为时的准确性,它在很大程度上受其识别实例局部的策略的影响。为了找到实例的局部性,我们需要计算实例与涉及分类和数值特征的扰动数据点之间的距离。虽然数值特征的距离可以精确测量,但现有的工作通常采用粗粒度或不精确的方法来比较分类特征。这在分类数据设置中尤其成问题,因为定义具有代表性的位置需要分类之间的细粒度语义相似性信息。在本文中,我们提出了一种分类数据分类器的局部生成方法,该方法不假设领域知识,并依靠模型的解释来推断类别相似性。此外,我们设计了一种基于衍生相似度信息的多中心采样方法,与传统的以实例为中心的技术相比,该方法更有效地捕获了模型的局部行为。此外,我们开发了一种基于知识图的基于知识的局部生成方法,将基于解释的方法与由领域专家提供相似信息的场景进行比较。在各种数据集上进行的实验证明了我们的方法在生成忠实解释方面的有效性。
{"title":"Interpreting Categorical Data Classifiers using Explanation-based Locality","authors":"P. Rasouli, Ingrid Chieh Yu, E. Jiménez-Ruiz","doi":"10.1109/ICDMW58026.2022.00030","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00030","url":null,"abstract":"Local surrogate explanation methods are a popular class of post-hoc interpretability approaches that explain the rationale of machine learning models in the locality of every particular instance. Fidelity, which refers to the accuracy of explanation methods in imitating the actual behavior of a model, is highly affected by their strategy for identifying the locality of instances. To find the locality of an instance, we need to calculate the distance between the instance and perturbed data points concerning categorical and numerical features. While the distance of numerical features can be measured precisely, the existing works usually adopt a coarse-grained or imprecise approach for comparing categorical features. This is especially problematic in the categorical data setting, where defining a representative locality demands fine-grained semantic similarity information between categories. In this paper, we propose a locality generation approach for categorical data classifiers that makes no assumption about domain knowledge and infers categorical similarities by relying on the model's explanations. Further, we devise a multi-centered sampling approach based on the derived similarity information that, compared to the conventional instance-centered technique, captures the local behavior of the model more effectively. Moreover, we develop a knowledge-based locality generation approach based on knowledge graphs to benchmark our explanation-based method against a scenario where the similarity information is provided by a domain expert. The experiments conducted on various data sets demonstrate the efficacy of our approach in generating faithful explanations.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129300633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2022 IEEE International Conference on Data Mining Workshops (ICDMW)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1