Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00009
Lu Jiang, Y. Li, Na Luo, Jianan Wang, Qiao Ning
With the development of technology and sharing economy, Airbnb as a famous short-term rental platform, has become the first choice for many young people to select. The issue of Airbnb's pricing has always been a problem worth studying. While the previous studies achieve promising results, there are exists deficiencies to solve. Such as, (1) the feature attributes of rental are not rich enough; (2) the research on rental text information is not deep enough; (3) there are few studies on predicting the rental price combined with the point of interest(POI) around the house. To address the above challenges, we proposes a multi-source information embedding(MSIE) model to predict the rental price of Airbnb. Specifically, we first selects the statistical feature to embed the original rental data. Secondly, we generates the word feature vector and emotional score combination of three different text information to form the text feature embedding. Thirdly, we uses the points of interest(POI) around the rental house information generates a variety of spatial network graphs, and learns the embedding of the network to obtain the spatial feature embedding. Finally, this paper combines the three modules into multi source rental representations, and uses the constructed fully connected neural network to predict the price. The analysis of the experimental results shows the effectiveness of our proposed model.
{"title":"A Multi-Source Information Learning Framework for Airbnb Price Prediction","authors":"Lu Jiang, Y. Li, Na Luo, Jianan Wang, Qiao Ning","doi":"10.1109/ICDMW58026.2022.00009","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00009","url":null,"abstract":"With the development of technology and sharing economy, Airbnb as a famous short-term rental platform, has become the first choice for many young people to select. The issue of Airbnb's pricing has always been a problem worth studying. While the previous studies achieve promising results, there are exists deficiencies to solve. Such as, (1) the feature attributes of rental are not rich enough; (2) the research on rental text information is not deep enough; (3) there are few studies on predicting the rental price combined with the point of interest(POI) around the house. To address the above challenges, we proposes a multi-source information embedding(MSIE) model to predict the rental price of Airbnb. Specifically, we first selects the statistical feature to embed the original rental data. Secondly, we generates the word feature vector and emotional score combination of three different text information to form the text feature embedding. Thirdly, we uses the points of interest(POI) around the rental house information generates a variety of spatial network graphs, and learns the embedding of the network to obtain the spatial feature embedding. Finally, this paper combines the three modules into multi source rental representations, and uses the constructed fully connected neural network to predict the price. The analysis of the experimental results shows the effectiveness of our proposed model.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130213241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00053
Huanyi Zhou, Honggang Zhao, Wenlu Wang
Sparse computed tomography (CT) reconstruction can lead to significant streak artifacts. Image restoration that removes these artifacts while recovering image features is an important area of research in low-dose sparse CT imaging. In pre-clinical research, where a lag still exists in the use of professional CT equipment, existing imaging devices provide limited X-ray dose energy accompanied by strong noise patterns when scanning. Reconstructed CT images contain significant noise and artifacts. We propose a deep transfer learning (DTL) neural network training method that exploits open-source data for initial training and a small-scale detected phantom image with its total variation result for transfer learning to address this issue. We hypothesize that a pre-trained neural network from open-source data has no prior knowledge of our device configuration, which prevents its application on our measured data, and deep transfer learning on small-scale detected phantom can feed specific configurations into the model. Our experiment has demonstrated that our proposed method, incorporating a modified total variation (TV) algorithm, can successfully realize a good balance between artifact removal and image feature restoration.
{"title":"U-Net Transfer Learning for Image Restoration on Sparse CT Reconstruction in Pre-Clinical Research","authors":"Huanyi Zhou, Honggang Zhao, Wenlu Wang","doi":"10.1109/ICDMW58026.2022.00053","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00053","url":null,"abstract":"Sparse computed tomography (CT) reconstruction can lead to significant streak artifacts. Image restoration that removes these artifacts while recovering image features is an important area of research in low-dose sparse CT imaging. In pre-clinical research, where a lag still exists in the use of professional CT equipment, existing imaging devices provide limited X-ray dose energy accompanied by strong noise patterns when scanning. Reconstructed CT images contain significant noise and artifacts. We propose a deep transfer learning (DTL) neural network training method that exploits open-source data for initial training and a small-scale detected phantom image with its total variation result for transfer learning to address this issue. We hypothesize that a pre-trained neural network from open-source data has no prior knowledge of our device configuration, which prevents its application on our measured data, and deep transfer learning on small-scale detected phantom can feed specific configurations into the model. Our experiment has demonstrated that our proposed method, incorporating a modified total variation (TV) algorithm, can successfully realize a good balance between artifact removal and image feature restoration.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128594862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00029
Seiji Okura, T. Mohri
Fairness in machine learning is a research area that is recently established, for mitigating bias of unfair models that treat unprivileged people unfavorably based on protected attributes. We want to take an approach for mitigating such bias based on the idea of data segmentation, that is, dividing data into segments where people should be treated similarly. Such an approach should be useful in the sense that the mitigation process itself is explainable for cases that similar people should be treated similarly. Although research on such cases exists, the question of effectiveness of data segmentation itself, however, remains to be answered. In this paper, we answer this question by empirically analyzing the experimental results of data segmentation by using two datasets, i.e., the UCI Adult dataset and the Kaggle ‘Give me some credit’ (gmsc) dataset. We empirically show that (1) fairness can be controllable during training models by the way of dividing data into segments; more specifically, by selecting the attributes and setting the number of segments for adjusting statistics such as statistical parity of the segments and mutual information between the attributes, etc. (2) the effects of data segmentation is dependent on classifiers, and (3) there exist weak trade-offs between fairness and accuracy with regard to data segmentation.
机器学习中的公平性是最近建立的一个研究领域,旨在减轻基于受保护属性对弱势群体不利的不公平模型的偏见。我们希望采取一种基于数据分割思想的方法来减轻这种偏见,也就是说,将数据划分为人们应该被类似对待的部分。这种做法应该是有用的,因为缓解过程本身可以解释类似的人应该受到类似对待的情况。虽然对此类案例进行了研究,但数据分割本身的有效性问题仍有待解决。本文通过对UCI Adult数据集和Kaggle“Give me some credit”(gmsc)数据集的数据分割实验结果进行实证分析,回答了这个问题。我们的实证研究表明:(1)在训练模型过程中,通过将数据分段的方式,公平性是可控的;更具体地说,通过选择属性和设置段数来调整统计数据,如段的统计奇偶性和属性之间的相互信息等(2)数据分割的效果依赖于分类器;(3)数据分割的公平性和准确性之间存在较弱的权衡。
{"title":"Empirical analysis of fairness-aware data segmentation","authors":"Seiji Okura, T. Mohri","doi":"10.1109/ICDMW58026.2022.00029","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00029","url":null,"abstract":"Fairness in machine learning is a research area that is recently established, for mitigating bias of unfair models that treat unprivileged people unfavorably based on protected attributes. We want to take an approach for mitigating such bias based on the idea of data segmentation, that is, dividing data into segments where people should be treated similarly. Such an approach should be useful in the sense that the mitigation process itself is explainable for cases that similar people should be treated similarly. Although research on such cases exists, the question of effectiveness of data segmentation itself, however, remains to be answered. In this paper, we answer this question by empirically analyzing the experimental results of data segmentation by using two datasets, i.e., the UCI Adult dataset and the Kaggle ‘Give me some credit’ (gmsc) dataset. We empirically show that (1) fairness can be controllable during training models by the way of dividing data into segments; more specifically, by selecting the attributes and setting the number of segments for adjusting statistics such as statistical parity of the segments and mutual information between the attributes, etc. (2) the effects of data segmentation is dependent on classifiers, and (3) there exist weak trade-offs between fairness and accuracy with regard to data segmentation.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130234027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00042
Mahmoud Seifallahi, J. Galvin, B. Ghoraani
Mild cognitive impairment (MCl) is abnormal cognitive decline beyond expected normal decline. The rate of progression to Alzheimer's disease (AD) in people with MCl is an estimated 80% in 6 years. However, identifying MCI from normal cognition in older adults remains a clinical challenge in early AD detection. We investigated a new method for detecting MCI based on patients' gait and balance. Our approach performs a comprehensive analysis of the Timed Up and Go test (TUG), based on the first application of a Kinect v.2 camera to record and provide movement measures and machine learning to differentiate between the two groups of older adults with MCI and healthy controls (HC). We collected movement data from 25 joints of the body via a Kinect v.2 camera as 30 HC and 25 MCI subjects performed TUG. The collected data provided a comprehensive list of gait and balance measures with 61 features, including duration of TUG, duration and velocity of transition phases, and micro and macro gait features. Our analysis evidenced that 25 features were significantly different between MCI and HC subjects, where 20 of them were unique features as indicated by our correlation analysis. The classification results using three different classifiers of support vector machine (SVM), random forest, and artificial neural network showed that the ability of our approach for detecting MCI subjects with the highest performance was using SVM with 94% accuracy, 100 % precision, 93.33% F-score, and 0.94 AUC. These observations suggest the possibility of our approach as a low-cost, easy-to-use MCI screening tool for objectively detecting subjects at high risk of developing AD. Such a tool is well-suited for widespread application in clinical settings and nursing homes to detect early signs of cognitive impairment and promote healthy aging.
轻度认知障碍(Mild cognitive impairment, MCl)是指超出预期的正常认知能力下降。MCl患者在6年内发展为阿尔茨海默病(AD)的比率估计为80%。然而,从老年人的正常认知中识别MCI仍然是早期AD检测的临床挑战。我们研究了一种基于患者步态和平衡的MCI检测新方法。我们的方法对Timed Up and Go测试(TUG)进行了全面的分析,基于Kinect v.2摄像头的首次应用,该摄像头记录并提供运动测量和机器学习,以区分两组患有轻度认知障碍和健康对照(HC)的老年人。当30名HC和25名MCI受试者进行TUG时,我们通过Kinect v.2摄像头收集了身体25个关节的运动数据。收集的数据提供了61个特征的步态和平衡测量的综合列表,包括TUG的持续时间,过渡阶段的持续时间和速度,以及微观和宏观的步态特征。我们的分析表明,有25个特征在MCI和HC受试者之间存在显著差异,其中20个特征是我们的相关分析所显示的独特特征。使用支持向量机(SVM)、随机森林(random forest)和人工神经网络(artificial neural network)三种不同分类器对MCI受试者进行分类的结果表明,使用SVM检测MCI受试者的准确率为94%,精密度为100%,f分数为93.33%,AUC为0.94。这些观察结果表明,我们的方法有可能作为一种低成本、易于使用的MCI筛查工具,客观地检测患AD的高风险受试者。这种工具非常适合广泛应用于临床环境和养老院,以发现认知障碍的早期迹象,促进健康老龄化。
{"title":"Detection of Mild Cognitive Impairment from Quantitative Analysis of Timed Up and Go (TUG)","authors":"Mahmoud Seifallahi, J. Galvin, B. Ghoraani","doi":"10.1109/ICDMW58026.2022.00042","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00042","url":null,"abstract":"Mild cognitive impairment (MCl) is abnormal cognitive decline beyond expected normal decline. The rate of progression to Alzheimer's disease (AD) in people with MCl is an estimated 80% in 6 years. However, identifying MCI from normal cognition in older adults remains a clinical challenge in early AD detection. We investigated a new method for detecting MCI based on patients' gait and balance. Our approach performs a comprehensive analysis of the Timed Up and Go test (TUG), based on the first application of a Kinect v.2 camera to record and provide movement measures and machine learning to differentiate between the two groups of older adults with MCI and healthy controls (HC). We collected movement data from 25 joints of the body via a Kinect v.2 camera as 30 HC and 25 MCI subjects performed TUG. The collected data provided a comprehensive list of gait and balance measures with 61 features, including duration of TUG, duration and velocity of transition phases, and micro and macro gait features. Our analysis evidenced that 25 features were significantly different between MCI and HC subjects, where 20 of them were unique features as indicated by our correlation analysis. The classification results using three different classifiers of support vector machine (SVM), random forest, and artificial neural network showed that the ability of our approach for detecting MCI subjects with the highest performance was using SVM with 94% accuracy, 100 % precision, 93.33% F-score, and 0.94 AUC. These observations suggest the possibility of our approach as a low-cost, easy-to-use MCI screening tool for objectively detecting subjects at high risk of developing AD. Such a tool is well-suited for widespread application in clinical settings and nursing homes to detect early signs of cognitive impairment and promote healthy aging.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125740847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00024
Jianing Zhao, Ayana Takai, E. Kita
The ensemble model is applied for the stock price prediction in this study. The proposed ensemble model is based on the weighted average estimation of the values predicted by base algorithms. The base algorithms include Linear Regression, Long Short-Term Memory (LSTM), Support Vector Regression (SVR) and lightGBM. The performance of the proposed model depends on the weight parameters. The past data are collected to calculate the weigh parameters for base models of the ensemble models. The stock price prediction of Toyota Motor Corporation is considered as the numerical examples. Then LSTM, SVR and LightGBM are built to recognize the trend of the weight sequence data and to predict the most suitable combination weights for ensemble. The experimental results show that any ensemble models achieves significantly better accuracy than each component model. The proposed model also achieved the lowest error than simple average and error-based combination method. Even a tiny difference in choosing associated combining weights can play a crucial role in linear combination of models for prediction.
{"title":"Weight-Training Ensemble Model for Stock Price Forecast","authors":"Jianing Zhao, Ayana Takai, E. Kita","doi":"10.1109/ICDMW58026.2022.00024","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00024","url":null,"abstract":"The ensemble model is applied for the stock price prediction in this study. The proposed ensemble model is based on the weighted average estimation of the values predicted by base algorithms. The base algorithms include Linear Regression, Long Short-Term Memory (LSTM), Support Vector Regression (SVR) and lightGBM. The performance of the proposed model depends on the weight parameters. The past data are collected to calculate the weigh parameters for base models of the ensemble models. The stock price prediction of Toyota Motor Corporation is considered as the numerical examples. Then LSTM, SVR and LightGBM are built to recognize the trend of the weight sequence data and to predict the most suitable combination weights for ensemble. The experimental results show that any ensemble models achieves significantly better accuracy than each component model. The proposed model also achieved the lowest error than simple average and error-based combination method. Even a tiny difference in choosing associated combining weights can play a crucial role in linear combination of models for prediction.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128503683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00033
Chaekyu Lee, Jaekwang Kim
In this paper, we propose the star topology and rejection method (STARM), a new oversampling technique that generally performs well for varying data and algorithms. STARM is a hybrid technique that combines the advantages of Polynom-fit-SMOTE, LEE, and SMOTE, all of which have yielded high performance based on different technical features, and eliminates their disadvantages. To verify that the proposed technique exhibits high performance in general situations, we conducted 28,028 experiments to compare the predictive performance of 77 oversampling techniques with four machine learning algorithms for 91 imbalanced datasets of various types. Consequently, STARM yielded the highest performance on average among the 77 techniques. In addition, it showed excellent performance even in various algorithms, various imbalanced ratios, and various data volumes.
{"title":"Hybrid Oversampling Technique Based on Star Topology and Rejection Methodology for Classifying Imbalanced Data","authors":"Chaekyu Lee, Jaekwang Kim","doi":"10.1109/ICDMW58026.2022.00033","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00033","url":null,"abstract":"In this paper, we propose the star topology and rejection method (STARM), a new oversampling technique that generally performs well for varying data and algorithms. STARM is a hybrid technique that combines the advantages of Polynom-fit-SMOTE, LEE, and SMOTE, all of which have yielded high performance based on different technical features, and eliminates their disadvantages. To verify that the proposed technique exhibits high performance in general situations, we conducted 28,028 experiments to compare the predictive performance of 77 oversampling techniques with four machine learning algorithms for 91 imbalanced datasets of various types. Consequently, STARM yielded the highest performance on average among the 77 techniques. In addition, it showed excellent performance even in various algorithms, various imbalanced ratios, and various data volumes.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126988768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00111
M. Mainsant, M. Mermillod, C. Godin, M. Reyboz
Continual learning is one of the major challenges of deep learning. For decades, many studies have proposed efficient models overcoming catastrophic forgetting when learning new data. However, as they were focused on providing the best reduce-forgetting performance, studies have moved away from real-life applications where algorithms need to adapt to changing environments and perform, no matter the type of data arrival. Therefore, there is a growing need to define new scenarios to assess the robustness of existing methods with those challenges in mind. The issue of data availability during training is another essential point in the development of solid continual learning algorithms. Depending on the streaming formulation, the model needs in the more extreme scenarios to be able to adapt to new data as soon as it arrives and without the possibility to review it afterwards. In this study, we propose a review of existing continual learning scenarios and their associated terms. Those existing terms and definitions are synthesized in an atlas in order to provide a better overview. Based on two of the main categories defined in the atlas, “Class-IL.” and “Domain-IL”, we define eight different scenarios with data streams of varying complexity that allow to test the models robustness in changing data arrival scenarios. We choose to evaluate Dream Net - Data Free, a privacy-preserving continual learning algorithm, in each proposed scenario and demonstrate that this model is robust enough to succeed in every proposed scenario, regardless of how the data is presented. We also show that it is competitive with other continual learning literature algorithms that are not privacy preserving which is a clear advantage for real-life human-centered applications.
持续学习是深度学习的主要挑战之一。几十年来,许多研究提出了克服学习新数据时灾难性遗忘的有效模型。然而,由于他们专注于提供最佳的减少遗忘性能,研究已经远离了现实应用,在现实应用中,算法需要适应不断变化的环境并执行,无论数据到达的类型如何。因此,越来越需要定义新的场景来评估现有方法的鲁棒性,同时考虑到这些挑战。训练期间的数据可用性问题是开发可靠的持续学习算法的另一个要点。根据流公式的不同,在更极端的情况下,模型需要能够在新数据到达时立即适应它,并且不可能在之后对其进行审查。在这项研究中,我们提出了现有的持续学习情景及其相关术语的回顾。为了提供更好的概述,这些现有的术语和定义被综合在一个地图集中。基于地图集中定义的两个主要类别,“Class-IL”。和“Domain-IL”,我们定义了8种不同的场景,其中包含不同复杂性的数据流,允许在不断变化的数据到达场景中测试模型的鲁棒性。我们选择在每个提议的场景中评估Dream Net - Data Free,这是一种保护隐私的连续学习算法,并证明该模型足够健壮,可以在每个提议的场景中成功,无论数据如何呈现。我们还表明,它与其他不保护隐私的持续学习文献算法具有竞争力,这对于现实生活中以人为中心的应用来说是一个明显的优势。
{"title":"A study of the Dream Net model robustness across continual learning scenarios","authors":"M. Mainsant, M. Mermillod, C. Godin, M. Reyboz","doi":"10.1109/ICDMW58026.2022.00111","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00111","url":null,"abstract":"Continual learning is one of the major challenges of deep learning. For decades, many studies have proposed efficient models overcoming catastrophic forgetting when learning new data. However, as they were focused on providing the best reduce-forgetting performance, studies have moved away from real-life applications where algorithms need to adapt to changing environments and perform, no matter the type of data arrival. Therefore, there is a growing need to define new scenarios to assess the robustness of existing methods with those challenges in mind. The issue of data availability during training is another essential point in the development of solid continual learning algorithms. Depending on the streaming formulation, the model needs in the more extreme scenarios to be able to adapt to new data as soon as it arrives and without the possibility to review it afterwards. In this study, we propose a review of existing continual learning scenarios and their associated terms. Those existing terms and definitions are synthesized in an atlas in order to provide a better overview. Based on two of the main categories defined in the atlas, “Class-IL.” and “Domain-IL”, we define eight different scenarios with data streams of varying complexity that allow to test the models robustness in changing data arrival scenarios. We choose to evaluate Dream Net - Data Free, a privacy-preserving continual learning algorithm, in each proposed scenario and demonstrate that this model is robust enough to succeed in every proposed scenario, regardless of how the data is presented. We also show that it is competitive with other continual learning literature algorithms that are not privacy preserving which is a clear advantage for real-life human-centered applications.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"1513 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129317867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00110
Baptiste Lafabregue, P. Gançarski, J. Weber, G. Forestier
Automatically extracting knowledge from various datasets is a valuable task to help experts explore new types of data and save time on annotations. This is especially required for new topics such as emergency management or environmental monitoring. Traditional unsupervised methods often tend to not fulfill experts' intuitions or non-formalized knowledge. On the other hand, supervised methods tend to require a lot of knowledge to be efficient. Constrained clustering, a form of semi-supervised methods, mitigates these two effects, as it allows experts to inject their knowledge into the clustering process. However, constraints often have a poor effect on the result because it is hard for experts to give both informative and coherent constraints. Based on the idea that it is easier to criticize than to construct, this article presents a new method, I-SAMARAH, an incremental constrained clustering method. Through an iterative process, it alternates between a clustering phase where constraints are incorporated, and a criticize phase where the expert can give feedback on the clustering. We demonstrate experimentally the efficiency of our method on remote sensing image time series. We compare it to other constrained clustering methods in terms of result quality and to supervised methods in terms of number of annotations.
{"title":"Incremental constrained clustering with application to remote sensing images time series","authors":"Baptiste Lafabregue, P. Gançarski, J. Weber, G. Forestier","doi":"10.1109/ICDMW58026.2022.00110","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00110","url":null,"abstract":"Automatically extracting knowledge from various datasets is a valuable task to help experts explore new types of data and save time on annotations. This is especially required for new topics such as emergency management or environmental monitoring. Traditional unsupervised methods often tend to not fulfill experts' intuitions or non-formalized knowledge. On the other hand, supervised methods tend to require a lot of knowledge to be efficient. Constrained clustering, a form of semi-supervised methods, mitigates these two effects, as it allows experts to inject their knowledge into the clustering process. However, constraints often have a poor effect on the result because it is hard for experts to give both informative and coherent constraints. Based on the idea that it is easier to criticize than to construct, this article presents a new method, I-SAMARAH, an incremental constrained clustering method. Through an iterative process, it alternates between a clustering phase where constraints are incorporated, and a criticize phase where the expert can give feedback on the clustering. We demonstrate experimentally the efficiency of our method on remote sensing image time series. We compare it to other constrained clustering methods in terms of result quality and to supervised methods in terms of number of annotations.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132409988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00032
Felix Lanfermann, Sebastian Schmitt, Patricia Wollstadt
Identifying meaningful concepts in large data sets can provide valuable insights into engineering design problems. Concept identification aims at identifying non-overlapping groups of design instances that are similar in a joint space of all features, but which are also similar when considering only subsets of features. These subsets usually comprise features that characterize a design with respect to one specific context, for example, constructive design parameters, performance values, or operation modes. It is desirable to evaluate the quality of design concepts by considering several of these feature subsets in isolation. In particular, meaningful concepts should not only identify dense, well separated groups of data instances, but also provide non-overlapping groups of data that persist when considering pre-defined feature subsets separately. In this work, we propose to view concept identification as a special form of clustering algorithm with a broad range of potential applications beyond engineering design. To illustrate the differences between concept identification and classical clustering algorithms, we apply a recently proposed concept identification algorithm to two synthetic data sets and show the differences in identified solutions. In addition, we introduce the mutual information measure as a metric to evaluate whether solutions return consistent clusters across relevant subsets. To support the novel understanding of concept identification, we consider a simulated data set from a decision-making problem in the energy management domain and show that the identified clusters are more interpretable with respect to relevant feature subsets than clusters found by common clustering algorithms and are thus more suitable to support a decision maker.
{"title":"Understanding Concept Identification as Consistent Data Clustering Across Multiple Feature Spaces","authors":"Felix Lanfermann, Sebastian Schmitt, Patricia Wollstadt","doi":"10.1109/ICDMW58026.2022.00032","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00032","url":null,"abstract":"Identifying meaningful concepts in large data sets can provide valuable insights into engineering design problems. Concept identification aims at identifying non-overlapping groups of design instances that are similar in a joint space of all features, but which are also similar when considering only subsets of features. These subsets usually comprise features that characterize a design with respect to one specific context, for example, constructive design parameters, performance values, or operation modes. It is desirable to evaluate the quality of design concepts by considering several of these feature subsets in isolation. In particular, meaningful concepts should not only identify dense, well separated groups of data instances, but also provide non-overlapping groups of data that persist when considering pre-defined feature subsets separately. In this work, we propose to view concept identification as a special form of clustering algorithm with a broad range of potential applications beyond engineering design. To illustrate the differences between concept identification and classical clustering algorithms, we apply a recently proposed concept identification algorithm to two synthetic data sets and show the differences in identified solutions. In addition, we introduce the mutual information measure as a metric to evaluate whether solutions return consistent clusters across relevant subsets. To support the novel understanding of concept identification, we consider a simulated data set from a decision-making problem in the energy management domain and show that the identified clusters are more interpretable with respect to relevant feature subsets than clusters found by common clustering algorithms and are thus more suitable to support a decision maker.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116344041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-01DOI: 10.1109/ICDMW58026.2022.00134
Shahrzad Gholami, Caleb Robinson, Anthony Ortiz, Siyu Yang, J. Margutti, Cameron Birge, R. Dodhia, J. Ferres
Natural disasters frequency is growing globally. Every year 350 million people are affected and billions of dollars of damage is incurred. Providing timely and appropriate humanitarian interventions like shelters, medical aid, and food to affected communities are challenging problems. AI frameworks can help support existing efforts in solving these problems in various ways. In this study, we propose using high-resolution satellite imagery from before and after disasters to develop a convolutional neural network model for localizing buildings and scoring their damage level. We categorize damage to buildings into four levels, spanning from not damaged to destroyed, based on the xView2 dataset's scale. Due to the emergency nature of disaster response efforts, the value of automating damage assessment lies primarily in the inference speed, rather than accuracy. We show that our proposed solution works three times faster than the fastest xView2 challenge winning solution and over 50 times faster than the slowest first place solution, which indicates a significant improvement from an operational viewpoint. Our proposed model achieves a pixel-wise Fl score of 0.74 for the building localization and a pixel-wise harmonic Fl score of 0.6 for damage classification and uses a simpler architecture compared to other studies. Additionally, we develop a web-based visualizer that can display the before and after imagery along with the model's building damage predictions on a custom map. This study has been collaboratively conducted to empower a humanitarian organization as the stakeholder, that plans to deploy and assess the model along with the visualizer for their disaster response efforts in the field.
{"title":"On the Deployment of Post-Disaster Building Damage Assessment Tools using Satellite Imagery: A Deep Learning Approach","authors":"Shahrzad Gholami, Caleb Robinson, Anthony Ortiz, Siyu Yang, J. Margutti, Cameron Birge, R. Dodhia, J. Ferres","doi":"10.1109/ICDMW58026.2022.00134","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00134","url":null,"abstract":"Natural disasters frequency is growing globally. Every year 350 million people are affected and billions of dollars of damage is incurred. Providing timely and appropriate humanitarian interventions like shelters, medical aid, and food to affected communities are challenging problems. AI frameworks can help support existing efforts in solving these problems in various ways. In this study, we propose using high-resolution satellite imagery from before and after disasters to develop a convolutional neural network model for localizing buildings and scoring their damage level. We categorize damage to buildings into four levels, spanning from not damaged to destroyed, based on the xView2 dataset's scale. Due to the emergency nature of disaster response efforts, the value of automating damage assessment lies primarily in the inference speed, rather than accuracy. We show that our proposed solution works three times faster than the fastest xView2 challenge winning solution and over 50 times faster than the slowest first place solution, which indicates a significant improvement from an operational viewpoint. Our proposed model achieves a pixel-wise Fl score of 0.74 for the building localization and a pixel-wise harmonic Fl score of 0.6 for damage classification and uses a simpler architecture compared to other studies. Additionally, we develop a web-based visualizer that can display the before and after imagery along with the model's building damage predictions on a custom map. This study has been collaboratively conducted to empower a humanitarian organization as the stakeholder, that plans to deploy and assess the model along with the visualizer for their disaster response efforts in the field.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123438009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}