Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00010
Niklas Strauß, Max Berrendorf, Tom Haider, M. Schubert
Modern Emergency Medical Services (EMS) benefit from real-time sensor information in various ways as they provide up-to-date location information and help assess current local emergency risks. A critical part of EMS is dynamic ambulance redeployment, i.e., the task of assigning idle ambulances to base stations throughout a community. Although there has been a considerable effort on methods to optimize emergency response systems, a comparison of proposed methods is generally difficult as reported results are mostly based on artificial and proprietary test beds. In this paper, we present a benchmark simulation environment for dynamic ambulance redeployment based on real emergency data from the city of San Francisco. Our proposed simulation environment is highly scalable and is compatible with modern reinforcement learning frameworks. We provide a comparative study of several state-of-the-art methods for various metrics. Results indicate that even simple baseline algorithms can perform considerably well in close-to-realistic settings. The code of our simulator is openly available at https://github.com/niklasdbs/ambusim.
{"title":"A Comparison of Ambulance Redeployment Systems on Real-World Data","authors":"Niklas Strauß, Max Berrendorf, Tom Haider, M. Schubert","doi":"10.1109/ICDMW58026.2022.00010","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00010","url":null,"abstract":"Modern Emergency Medical Services (EMS) benefit from real-time sensor information in various ways as they provide up-to-date location information and help assess current local emergency risks. A critical part of EMS is dynamic ambulance redeployment, i.e., the task of assigning idle ambulances to base stations throughout a community. Although there has been a considerable effort on methods to optimize emergency response systems, a comparison of proposed methods is generally difficult as reported results are mostly based on artificial and proprietary test beds. In this paper, we present a benchmark simulation environment for dynamic ambulance redeployment based on real emergency data from the city of San Francisco. Our proposed simulation environment is highly scalable and is compatible with modern reinforcement learning frameworks. We provide a comparative study of several state-of-the-art methods for various metrics. Results indicate that even simple baseline algorithms can perform considerably well in close-to-realistic settings. The code of our simulator is openly available at https://github.com/niklasdbs/ambusim.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133739470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00038
Rodrigo Fay Verqara, Paulo Henrique dos Santos, Guilherme Fay Verqara, Fábio L. L. Mendonça, C. E. L. Veiga, B. Praciano, Daniel Alves da Silva, Rafael Timóteo de Sousa Júnior
This article presents a study of an automatic speech recognition system in Portuguese applied to videos by the General Attorney of the Union of Brazil. As they are confidential videos, using proprietary software from large companies is not allowed for security reasons. Thus, constructing an artificial intelligence model capable of performing automatic speech recognition in Portuguese in the judicial context and making this model available for large-scale inference is critical to maintaining data security. For this purpose, a dataset in Brazilian Portuguese was used by a combination of 3 datasets already built. The system used TDNN Jasper and QuartzNet architectures for network training, obtaining promising preliminary results, having a word error rate (WER) of 56% without using a linguistic model.
{"title":"A study of automatic speech recognition in Portuguese by the Brazilian General Attorney of the Union","authors":"Rodrigo Fay Verqara, Paulo Henrique dos Santos, Guilherme Fay Verqara, Fábio L. L. Mendonça, C. E. L. Veiga, B. Praciano, Daniel Alves da Silva, Rafael Timóteo de Sousa Júnior","doi":"10.1109/ICDMW58026.2022.00038","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00038","url":null,"abstract":"This article presents a study of an automatic speech recognition system in Portuguese applied to videos by the General Attorney of the Union of Brazil. As they are confidential videos, using proprietary software from large companies is not allowed for security reasons. Thus, constructing an artificial intelligence model capable of performing automatic speech recognition in Portuguese in the judicial context and making this model available for large-scale inference is critical to maintaining data security. For this purpose, a dataset in Brazilian Portuguese was used by a combination of 3 datasets already built. The system used TDNN Jasper and QuartzNet architectures for network training, obtaining promising preliminary results, having a word error rate (WER) of 56% without using a linguistic model.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132387894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00143
Ming Fan, Lujun Zhang, Siyan Liu, Tiantian Yang, Dawei Lu
Simulation of reservoir releases plays a critical role in social-economic functioning and our nation's security. How-ever, it is challenging to predict the reservoir release accurately because of many influential factors from natural environments and engineering controls such as the reservoir inflow and storage. Moreover, climate change and hydrological intensification causing the extreme precipitation and temperature make the accurate prediction of reservoir releases even more challenging. Machine learning (ML) methods have shown some successful applications in simulating reservoir releases. However, previous studies mainly used inflow and storage data as inputs and only considered their short-term influences (e.g, previous one or two days). In this work, we use long short-term memory (LSTM) networks for reservoir release prediction based on four input variables including inflow, storage, precipitation, and temperature and consider their long-term influences. We apply the LSTM model to 30 reservoirs in Upper Colorado River Basin, United States. We analyze the prediction performance using six statistical metrics. More importantly, we investigate the influence of the input hydrometeorological factors, as well as their temporal effects on reservoir release decisions. Results indicate that inflow and storage are the most influential factors but the inclusion of precipitation and temperature can further improve the prediction of release especially in low flows. Additionally, the inflow and storage have a relatively long-term effect on the release. These findings can help optimize the water resources management in the reservoirs.
{"title":"Identifying Hydrometeorological Factors Influencing Reservoir Releases Using Machine Learning Methods","authors":"Ming Fan, Lujun Zhang, Siyan Liu, Tiantian Yang, Dawei Lu","doi":"10.1109/ICDMW58026.2022.00143","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00143","url":null,"abstract":"Simulation of reservoir releases plays a critical role in social-economic functioning and our nation's security. How-ever, it is challenging to predict the reservoir release accurately because of many influential factors from natural environments and engineering controls such as the reservoir inflow and storage. Moreover, climate change and hydrological intensification causing the extreme precipitation and temperature make the accurate prediction of reservoir releases even more challenging. Machine learning (ML) methods have shown some successful applications in simulating reservoir releases. However, previous studies mainly used inflow and storage data as inputs and only considered their short-term influences (e.g, previous one or two days). In this work, we use long short-term memory (LSTM) networks for reservoir release prediction based on four input variables including inflow, storage, precipitation, and temperature and consider their long-term influences. We apply the LSTM model to 30 reservoirs in Upper Colorado River Basin, United States. We analyze the prediction performance using six statistical metrics. More importantly, we investigate the influence of the input hydrometeorological factors, as well as their temporal effects on reservoir release decisions. Results indicate that inflow and storage are the most influential factors but the inclusion of precipitation and temperature can further improve the prediction of release especially in low flows. Additionally, the inflow and storage have a relatively long-term effect on the release. These findings can help optimize the water resources management in the reservoirs.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127820176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00061
A. Choudhary, E. Cambria
With social media pervading all aspects of our life, the opinions expressed by netizens are a gold mine ready to be exploited in a meaningful way to influence all major public do-mains. Sentiment analysis is a way to interpret this unstructured data using AI tools. It is a well-known fact that there has been a 'Zoom Boom’ in the field of aesthetic plastic surgery due to the COVID-19 pandemic and the same has put the focus of attention sharply on our appearance. Polarity detection of tweets published on popular aesthetic plastic surgery procedures before and after the onset of COVID can provide great insights for aesthetic plastic surgeons and the health industry at large. In this work, we develop an end-to-end system for the sentiment analysis of such tweets incorporating a state-of-the-art fine-tuned deep learning model, an ingenious 'keyword search and filter approach’ and SenticNet. Our system was tested on a large database of 196,900 tweets and the results were visualized using affectively correct word clouds and also subjected to rigorous statistical hypothesis testing to draw meaningful inferences. The results showed a high level of statistical significance.
{"title":"Making Sense of Sentiments for Aesthetic Plastic Surgery","authors":"A. Choudhary, E. Cambria","doi":"10.1109/ICDMW58026.2022.00061","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00061","url":null,"abstract":"With social media pervading all aspects of our life, the opinions expressed by netizens are a gold mine ready to be exploited in a meaningful way to influence all major public do-mains. Sentiment analysis is a way to interpret this unstructured data using AI tools. It is a well-known fact that there has been a 'Zoom Boom’ in the field of aesthetic plastic surgery due to the COVID-19 pandemic and the same has put the focus of attention sharply on our appearance. Polarity detection of tweets published on popular aesthetic plastic surgery procedures before and after the onset of COVID can provide great insights for aesthetic plastic surgeons and the health industry at large. In this work, we develop an end-to-end system for the sentiment analysis of such tweets incorporating a state-of-the-art fine-tuned deep learning model, an ingenious 'keyword search and filter approach’ and SenticNet. Our system was tested on a large database of 196,900 tweets and the results were visualized using affectively correct word clouds and also subjected to rigorous statistical hypothesis testing to draw meaningful inferences. The results showed a high level of statistical significance.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131449944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00023
Benoit Vuillemin, F. Bertrand
In the field of process mining, malfunction analysis is a major research domain. The goal here is to find failures or relatively large processing delays and their possible causes. This paper presents an innovative research paradigm for process mining: prediction rule mining. Through a three-step method and two new algorithms, all observed cases of a process are decomposed into rules, whose information is analyzed, and possible causes are searched. This method provides information about the data, from its internal structure to the possible causes of failures, without having a priori knowledge about them.
{"title":"Identify malfunctions and their possible causes using rules, application to process mining","authors":"Benoit Vuillemin, F. Bertrand","doi":"10.1109/ICDMW58026.2022.00023","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00023","url":null,"abstract":"In the field of process mining, malfunction analysis is a major research domain. The goal here is to find failures or relatively large processing delays and their possible causes. This paper presents an innovative research paradigm for process mining: prediction rule mining. Through a three-step method and two new algorithms, all observed cases of a process are decomposed into rules, whose information is analyzed, and possible causes are searched. This method provides information about the data, from its internal structure to the possible causes of failures, without having a priori knowledge about them.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"123 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129760751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00125
Julita Bielaniewicz, Kamil Kanclerz, P. Milkowski, Marcin Gruza, Konrad Karanowski, Przemyslaw Kazienko, Jan Kocoń
As humans, we experience a wide range of feelings and reactions. One of these is laughter, often related to a personal sense of humor and the perception of funny content. Due to its subjective nature, recognizing humor in NLP is a very challenging task. Here, we present a new approach to the task of predicting humor in the text by applying the idea of a personalized approach. It takes into account both the text and the context of the content receiver. For that purpose, we proposed four Deep-SHEEP learning models that take advantage of user preference information differently. The experiments were conducted on four datasets: Cockamamie, HUMOR, Jester, and Humicroedit. The results have shown that the application of an innovative personalized approach and user-centric perspective significantly improves performance compared to generalized methods. Moreover, even for random text embeddings, our personalized methods outperform the generalized ones in the subjective humor modeling task. We also argue that the user-related data reflecting an individual sense of humor has similar importance as the evaluated text itself. Different types of humor were investigated as well.
{"title":"Deep-SHEEP: Sense of Humor Extraction from Embeddings in the Personalized Context","authors":"Julita Bielaniewicz, Kamil Kanclerz, P. Milkowski, Marcin Gruza, Konrad Karanowski, Przemyslaw Kazienko, Jan Kocoń","doi":"10.1109/ICDMW58026.2022.00125","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00125","url":null,"abstract":"As humans, we experience a wide range of feelings and reactions. One of these is laughter, often related to a personal sense of humor and the perception of funny content. Due to its subjective nature, recognizing humor in NLP is a very challenging task. Here, we present a new approach to the task of predicting humor in the text by applying the idea of a personalized approach. It takes into account both the text and the context of the content receiver. For that purpose, we proposed four Deep-SHEEP learning models that take advantage of user preference information differently. The experiments were conducted on four datasets: Cockamamie, HUMOR, Jester, and Humicroedit. The results have shown that the application of an innovative personalized approach and user-centric perspective significantly improves performance compared to generalized methods. Moreover, even for random text embeddings, our personalized methods outperform the generalized ones in the subjective humor modeling task. We also argue that the user-related data reflecting an individual sense of humor has similar importance as the evaluated text itself. Different types of humor were investigated as well.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133326524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00019
Ahmet Tugrul Bayrak
Effective and powerful strategic planning in a competitive business environment brings businesses to the fore. It is important for the growth of the business to move the customer to the center by acting more intelligently in the planning of marketing and sales activities. In order to find customer behavior patterns, the use of clustering models from machine learning algorithms can yield effective results. In this study, traditional customer clustering methods are enriched by using customer representations as features. To be able to achieve that, a natural language processing method, word embedding, is applied to customers. By using the powerful mechanism of word embedding methods, a customer space is created where the customers are represented based on the products they have bought. It is observed that appending customer embeddings for customer clustering have a positive effect and the results seem promising for further studies.
{"title":"An application of Customer Embedding for Clustering","authors":"Ahmet Tugrul Bayrak","doi":"10.1109/ICDMW58026.2022.00019","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00019","url":null,"abstract":"Effective and powerful strategic planning in a competitive business environment brings businesses to the fore. It is important for the growth of the business to move the customer to the center by acting more intelligently in the planning of marketing and sales activities. In order to find customer behavior patterns, the use of clustering models from machine learning algorithms can yield effective results. In this study, traditional customer clustering methods are enriched by using customer representations as features. To be able to achieve that, a natural language processing method, word embedding, is applied to customers. By using the powerful mechanism of word embedding methods, a customer space is created where the customers are represented based on the products they have bought. It is observed that appending customer embeddings for customer clustering have a positive effect and the results seem promising for further studies.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124931831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00059
Joanna Baran, Jan Kocoń
Neuro-symbolic approaches explore ways to com-bine neural networks with traditional symbolic knowledge. These methods are gaining attention due to their efficiency and the requirement of fewer data compared to currently used deep models. This work investigated several neuro-symbolic models for sentiment analysis focusing on a variety of ways to add linguistic knowledge to the transformer-based architecture. English and Polish WordNets were used as a knowledge source with their polarity extensions (SentiWordNet, plWordNet Emo). The neuro- symbolic methods using knowledge during fine-tuning were not better or worse than the baseline model. However, a statistically significant gain of about three percentage points in the Fl- macro was obtained for the SentiLARE model that applied domain data - word sentiment labels - already at the pretraining stage. It was the most visible for medium-sized training sets. Therefore, developing an effective neuro-symbolic model is not trivial. The conclusions drawn from this work indicate a further need for a detailed study of these approaches, especially in natural language processing. In the context of sentiment classification, it could help design more efficient AI systems that can be deployed in business or marketing.
{"title":"Linguistic Knowledge Application to Neuro-Symbolic Transformers in Sentiment Analysis","authors":"Joanna Baran, Jan Kocoń","doi":"10.1109/ICDMW58026.2022.00059","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00059","url":null,"abstract":"Neuro-symbolic approaches explore ways to com-bine neural networks with traditional symbolic knowledge. These methods are gaining attention due to their efficiency and the requirement of fewer data compared to currently used deep models. This work investigated several neuro-symbolic models for sentiment analysis focusing on a variety of ways to add linguistic knowledge to the transformer-based architecture. English and Polish WordNets were used as a knowledge source with their polarity extensions (SentiWordNet, plWordNet Emo). The neuro- symbolic methods using knowledge during fine-tuning were not better or worse than the baseline model. However, a statistically significant gain of about three percentage points in the Fl- macro was obtained for the SentiLARE model that applied domain data - word sentiment labels - already at the pretraining stage. It was the most visible for medium-sized training sets. Therefore, developing an effective neuro-symbolic model is not trivial. The conclusions drawn from this work indicate a further need for a detailed study of these approaches, especially in natural language processing. In the context of sentiment classification, it could help design more efficient AI systems that can be deployed in business or marketing.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124004077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00022
José Alberto Sousa Torres, Paulo Henrique dos Santos, Daniel Alves da Silva, C. E. L. Veiga, Márcio Bastos Medeiros, Guilherme Fay Verqara, Fábio L. L. Mendonça, Rafael Timóteo de Sousa Júnior
The Amazon Rainforest is the most significant biodiversi-ty reserve on the planet. It plays a central role in combating global warming and climate change on the Earth. De-spite its importance, in 2021, the illegal deforestation process in the Brazilian Amazon rainforest had the worst year in a decade. The data show that more than 10,000 kilometers of native forest were destroyed that year-an increase of 29% compared to 2020. To fight against the action of deforesters, Brazilian environmental inspection agencies imposed more than 14 billion dollars in environmental fines in recent decades. However, it has not effectively reduced deforestation as only 4% of this amount was effectively collected-not inhibiting lawbreakers from deforesting. This is due to the difficulty of identifying the real transgressors, who use scapegoats to hide their crimes. The main objective of this paper is to propose an approach to find the real environmental transgressors through the analysis of data related to the fines imposed by Brazilian governmental agencies in the last three decades. We propose a method that employ clustering techniques in geo-graphic and temporal data extracted from fines to identify non-trivial correlations between scapegoats and large landowners. The automatically identified links were load-ed into a graph analysis database for accuracy assessment. The observed results were positive and indicated that this strategy could effectively identify the real culprits.
{"title":"Using spatial data and cluster analysis to automatically detect non-trivial relationships between environmental transgressors","authors":"José Alberto Sousa Torres, Paulo Henrique dos Santos, Daniel Alves da Silva, C. E. L. Veiga, Márcio Bastos Medeiros, Guilherme Fay Verqara, Fábio L. L. Mendonça, Rafael Timóteo de Sousa Júnior","doi":"10.1109/ICDMW58026.2022.00022","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00022","url":null,"abstract":"The Amazon Rainforest is the most significant biodiversi-ty reserve on the planet. It plays a central role in combating global warming and climate change on the Earth. De-spite its importance, in 2021, the illegal deforestation process in the Brazilian Amazon rainforest had the worst year in a decade. The data show that more than 10,000 kilometers of native forest were destroyed that year-an increase of 29% compared to 2020. To fight against the action of deforesters, Brazilian environmental inspection agencies imposed more than 14 billion dollars in environmental fines in recent decades. However, it has not effectively reduced deforestation as only 4% of this amount was effectively collected-not inhibiting lawbreakers from deforesting. This is due to the difficulty of identifying the real transgressors, who use scapegoats to hide their crimes. The main objective of this paper is to propose an approach to find the real environmental transgressors through the analysis of data related to the fines imposed by Brazilian governmental agencies in the last three decades. We propose a method that employ clustering techniques in geo-graphic and temporal data extracted from fines to identify non-trivial correlations between scapegoats and large landowners. The automatically identified links were load-ed into a graph analysis database for accuracy assessment. The observed results were positive and indicated that this strategy could effectively identify the real culprits.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128219766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00030
P. Rasouli, Ingrid Chieh Yu, E. Jiménez-Ruiz
Local surrogate explanation methods are a popular class of post-hoc interpretability approaches that explain the rationale of machine learning models in the locality of every particular instance. Fidelity, which refers to the accuracy of explanation methods in imitating the actual behavior of a model, is highly affected by their strategy for identifying the locality of instances. To find the locality of an instance, we need to calculate the distance between the instance and perturbed data points concerning categorical and numerical features. While the distance of numerical features can be measured precisely, the existing works usually adopt a coarse-grained or imprecise approach for comparing categorical features. This is especially problematic in the categorical data setting, where defining a representative locality demands fine-grained semantic similarity information between categories. In this paper, we propose a locality generation approach for categorical data classifiers that makes no assumption about domain knowledge and infers categorical similarities by relying on the model's explanations. Further, we devise a multi-centered sampling approach based on the derived similarity information that, compared to the conventional instance-centered technique, captures the local behavior of the model more effectively. Moreover, we develop a knowledge-based locality generation approach based on knowledge graphs to benchmark our explanation-based method against a scenario where the similarity information is provided by a domain expert. The experiments conducted on various data sets demonstrate the efficacy of our approach in generating faithful explanations.
{"title":"Interpreting Categorical Data Classifiers using Explanation-based Locality","authors":"P. Rasouli, Ingrid Chieh Yu, E. Jiménez-Ruiz","doi":"10.1109/ICDMW58026.2022.00030","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00030","url":null,"abstract":"Local surrogate explanation methods are a popular class of post-hoc interpretability approaches that explain the rationale of machine learning models in the locality of every particular instance. Fidelity, which refers to the accuracy of explanation methods in imitating the actual behavior of a model, is highly affected by their strategy for identifying the locality of instances. To find the locality of an instance, we need to calculate the distance between the instance and perturbed data points concerning categorical and numerical features. While the distance of numerical features can be measured precisely, the existing works usually adopt a coarse-grained or imprecise approach for comparing categorical features. This is especially problematic in the categorical data setting, where defining a representative locality demands fine-grained semantic similarity information between categories. In this paper, we propose a locality generation approach for categorical data classifiers that makes no assumption about domain knowledge and infers categorical similarities by relying on the model's explanations. Further, we devise a multi-centered sampling approach based on the derived similarity information that, compared to the conventional instance-centered technique, captures the local behavior of the model more effectively. Moreover, we develop a knowledge-based locality generation approach based on knowledge graphs to benchmark our explanation-based method against a scenario where the similarity information is provided by a domain expert. The experiments conducted on various data sets demonstrate the efficacy of our approach in generating faithful explanations.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129300633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}