2022 IEEE International Conference on Data Mining Workshops (ICDMW)最新文献_第8页

A Multi-Source Information Learning Framework for Airbnb Price Prediction Airbnb价格预测的多源信息学习框架

2022 IEEE International Conference on Data Mining Workshops (ICDMW)

Pub Date : 2022-11-01 DOI: 10.1109/ICDMW58026.2022.00009

Lu Jiang, Y. Li, Na Luo, Jianan Wang, Qiao Ning

With the development of technology and sharing economy, Airbnb as a famous short-term rental platform, has become the first choice for many young people to select. The issue of Airbnb's pricing has always been a problem worth studying. While the previous studies achieve promising results, there are exists deficiencies to solve. Such as, (1) the feature attributes of rental are not rich enough; (2) the research on rental text information is not deep enough; (3) there are few studies on predicting the rental price combined with the point of interest(POI) around the house. To address the above challenges, we proposes a multi-source information embedding(MSIE) model to predict the rental price of Airbnb. Specifically, we first selects the statistical feature to embed the original rental data. Secondly, we generates the word feature vector and emotional score combination of three different text information to form the text feature embedding. Thirdly, we uses the points of interest(POI) around the rental house information generates a variety of spatial network graphs, and learns the embedding of the network to obtain the spatial feature embedding. Finally, this paper combines the three modules into multi source rental representations, and uses the constructed fully connected neural network to predict the price. The analysis of the experimental results shows the effectiveness of our proposed model.

随着科技和共享经济的发展，Airbnb作为著名的短租平台，已经成为很多年轻人选择的首选。Airbnb的定价问题一直是一个值得研究的问题。虽然以往的研究取得了可喜的成果，但也存在不足。如:(1)租赁的特征属性不够丰富;(2)租赁文本信息研究不够深入;(3)结合房屋周边兴趣点(POI)预测租金价格的研究较少。为了解决上述挑战，我们提出了一个多源信息嵌入(MSIE)模型来预测Airbnb的租金价格。具体来说，我们首先选择统计特征嵌入原始租赁数据。其次，生成三种不同文本信息的词特征向量和情感得分组合，形成文本特征嵌入;第三，我们利用兴趣点(POI)周围的出租房屋信息生成各种空间网络图，并学习网络的嵌入来获得空间特征嵌入。最后，本文将三个模块组合成多源租金表示，并使用构建的全连接神经网络进行价格预测。实验结果表明了该模型的有效性。

{"title":"A Multi-Source Information Learning Framework for Airbnb Price Prediction","authors":"Lu Jiang, Y. Li, Na Luo, Jianan Wang, Qiao Ning","doi":"10.1109/ICDMW58026.2022.00009","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00009","url":null,"abstract":"With the development of technology and sharing economy, Airbnb as a famous short-term rental platform, has become the first choice for many young people to select. The issue of Airbnb's pricing has always been a problem worth studying. While the previous studies achieve promising results, there are exists deficiencies to solve. Such as, (1) the feature attributes of rental are not rich enough; (2) the research on rental text information is not deep enough; (3) there are few studies on predicting the rental price combined with the point of interest(POI) around the house. To address the above challenges, we proposes a multi-source information embedding(MSIE) model to predict the rental price of Airbnb. Specifically, we first selects the statistical feature to embed the original rental data. Secondly, we generates the word feature vector and emotional score combination of three different text information to form the text feature embedding. Thirdly, we uses the points of interest(POI) around the rental house information generates a variety of spatial network graphs, and learns the embedding of the network to obtain the spatial feature embedding. Finally, this paper combines the three modules into multi source rental representations, and uses the constructed fully connected neural network to predict the price. The analysis of the experimental results shows the effectiveness of our proposed model.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130213241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

U-Net Transfer Learning for Image Restoration on Sparse CT Reconstruction in Pre-Clinical Research 基于U-Net迁移学习的稀疏CT重建图像恢复临床前研究

2022 IEEE International Conference on Data Mining Workshops (ICDMW)

Pub Date : 2022-11-01 DOI: 10.1109/ICDMW58026.2022.00053

Huanyi Zhou, Honggang Zhao, Wenlu Wang

Sparse computed tomography (CT) reconstruction can lead to significant streak artifacts. Image restoration that removes these artifacts while recovering image features is an important area of research in low-dose sparse CT imaging. In pre-clinical research, where a lag still exists in the use of professional CT equipment, existing imaging devices provide limited X-ray dose energy accompanied by strong noise patterns when scanning. Reconstructed CT images contain significant noise and artifacts. We propose a deep transfer learning (DTL) neural network training method that exploits open-source data for initial training and a small-scale detected phantom image with its total variation result for transfer learning to address this issue. We hypothesize that a pre-trained neural network from open-source data has no prior knowledge of our device configuration, which prevents its application on our measured data, and deep transfer learning on small-scale detected phantom can feed specific configurations into the model. Our experiment has demonstrated that our proposed method, incorporating a modified total variation (TV) algorithm, can successfully realize a good balance between artifact removal and image feature restoration.

稀疏计算机断层扫描(CT)重建会导致显著的条纹伪影。在恢复图像特征的同时去除这些伪影的图像恢复是低剂量稀疏CT成像的一个重要研究领域。在临床前研究中，专业CT设备的使用仍然存在滞后，现有的成像设备在扫描时提供有限的x射线剂量能量并伴有较强的噪声模式。重建的CT图像含有明显的噪声和伪影。为了解决这一问题，我们提出了一种深度迁移学习(DTL)神经网络训练方法，该方法利用开源数据进行初始训练，并利用小尺度检测的幻影图像及其总变分结果进行迁移学习。我们假设，来自开源数据的预训练神经网络对我们的设备配置没有先验知识，这阻碍了它在我们的测量数据上的应用，而在小规模检测到的幻影上的深度迁移学习可以将特定配置输入模型。我们的实验表明，我们提出的方法，结合改进的全变分(TV)算法，可以成功地实现伪影去除和图像特征恢复之间的良好平衡。

{"title":"U-Net Transfer Learning for Image Restoration on Sparse CT Reconstruction in Pre-Clinical Research","authors":"Huanyi Zhou, Honggang Zhao, Wenlu Wang","doi":"10.1109/ICDMW58026.2022.00053","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00053","url":null,"abstract":"Sparse computed tomography (CT) reconstruction can lead to significant streak artifacts. Image restoration that removes these artifacts while recovering image features is an important area of research in low-dose sparse CT imaging. In pre-clinical research, where a lag still exists in the use of professional CT equipment, existing imaging devices provide limited X-ray dose energy accompanied by strong noise patterns when scanning. Reconstructed CT images contain significant noise and artifacts. We propose a deep transfer learning (DTL) neural network training method that exploits open-source data for initial training and a small-scale detected phantom image with its total variation result for transfer learning to address this issue. We hypothesize that a pre-trained neural network from open-source data has no prior knowledge of our device configuration, which prevents its application on our measured data, and deep transfer learning on small-scale detected phantom can feed specific configurations into the model. Our experiment has demonstrated that our proposed method, incorporating a modified total variation (TV) algorithm, can successfully realize a good balance between artifact removal and image feature restoration.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128594862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Empirical analysis of fairness-aware data segmentation 公平感知数据分割的实证分析

2022 IEEE International Conference on Data Mining Workshops (ICDMW)

Pub Date : 2022-11-01 DOI: 10.1109/ICDMW58026.2022.00029

Seiji Okura, T. Mohri

Fairness in machine learning is a research area that is recently established, for mitigating bias of unfair models that treat unprivileged people unfavorably based on protected attributes. We want to take an approach for mitigating such bias based on the idea of data segmentation, that is, dividing data into segments where people should be treated similarly. Such an approach should be useful in the sense that the mitigation process itself is explainable for cases that similar people should be treated similarly. Although research on such cases exists, the question of effectiveness of data segmentation itself, however, remains to be answered. In this paper, we answer this question by empirically analyzing the experimental results of data segmentation by using two datasets, i.e., the UCI Adult dataset and the Kaggle ‘Give me some credit’ (gmsc) dataset. We empirically show that (1) fairness can be controllable during training models by the way of dividing data into segments; more specifically, by selecting the attributes and setting the number of segments for adjusting statistics such as statistical parity of the segments and mutual information between the attributes, etc. (2) the effects of data segmentation is dependent on classifiers, and (3) there exist weak trade-offs between fairness and accuracy with regard to data segmentation.

机器学习中的公平性是最近建立的一个研究领域，旨在减轻基于受保护属性对弱势群体不利的不公平模型的偏见。我们希望采取一种基于数据分割思想的方法来减轻这种偏见，也就是说，将数据划分为人们应该被类似对待的部分。这种做法应该是有用的，因为缓解过程本身可以解释类似的人应该受到类似对待的情况。虽然对此类案例进行了研究，但数据分割本身的有效性问题仍有待解决。本文通过对UCI Adult数据集和Kaggle“Give me some credit”(gmsc)数据集的数据分割实验结果进行实证分析，回答了这个问题。我们的实证研究表明:(1)在训练模型过程中，通过将数据分段的方式，公平性是可控的;更具体地说，通过选择属性和设置段数来调整统计数据，如段的统计奇偶性和属性之间的相互信息等(2)数据分割的效果依赖于分类器;(3)数据分割的公平性和准确性之间存在较弱的权衡。

{"title":"Empirical analysis of fairness-aware data segmentation","authors":"Seiji Okura, T. Mohri","doi":"10.1109/ICDMW58026.2022.00029","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00029","url":null,"abstract":"Fairness in machine learning is a research area that is recently established, for mitigating bias of unfair models that treat unprivileged people unfavorably based on protected attributes. We want to take an approach for mitigating such bias based on the idea of data segmentation, that is, dividing data into segments where people should be treated similarly. Such an approach should be useful in the sense that the mitigation process itself is explainable for cases that similar people should be treated similarly. Although research on such cases exists, the question of effectiveness of data segmentation itself, however, remains to be answered. In this paper, we answer this question by empirically analyzing the experimental results of data segmentation by using two datasets, i.e., the UCI Adult dataset and the Kaggle ‘Give me some credit’ (gmsc) dataset. We empirically show that (1) fairness can be controllable during training models by the way of dividing data into segments; more specifically, by selecting the attributes and setting the number of segments for adjusting statistics such as statistical parity of the segments and mutual information between the attributes, etc. (2) the effects of data segmentation is dependent on classifiers, and (3) there exist weak trade-offs between fairness and accuracy with regard to data segmentation.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130234027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Detection of Mild Cognitive Impairment from Quantitative Analysis of Timed Up and Go (TUG) 从计时起走(Timed Up and Go, TUG)定量分析检测轻度认知障碍

2022 IEEE International Conference on Data Mining Workshops (ICDMW)

Pub Date : 2022-11-01 DOI: 10.1109/ICDMW58026.2022.00042

Mahmoud Seifallahi, J. Galvin, B. Ghoraani

Mild cognitive impairment (MCl) is abnormal cognitive decline beyond expected normal decline. The rate of progression to Alzheimer's disease (AD) in people with MCl is an estimated 80% in 6 years. However, identifying MCI from normal cognition in older adults remains a clinical challenge in early AD detection. We investigated a new method for detecting MCI based on patients' gait and balance. Our approach performs a comprehensive analysis of the Timed Up and Go test (TUG), based on the first application of a Kinect v.2 camera to record and provide movement measures and machine learning to differentiate between the two groups of older adults with MCI and healthy controls (HC). We collected movement data from 25 joints of the body via a Kinect v.2 camera as 30 HC and 25 MCI subjects performed TUG. The collected data provided a comprehensive list of gait and balance measures with 61 features, including duration of TUG, duration and velocity of transition phases, and micro and macro gait features. Our analysis evidenced that 25 features were significantly different between MCI and HC subjects, where 20 of them were unique features as indicated by our correlation analysis. The classification results using three different classifiers of support vector machine (SVM), random forest, and artificial neural network showed that the ability of our approach for detecting MCI subjects with the highest performance was using SVM with 94% accuracy, 100 % precision, 93.33% F-score, and 0.94 AUC. These observations suggest the possibility of our approach as a low-cost, easy-to-use MCI screening tool for objectively detecting subjects at high risk of developing AD. Such a tool is well-suited for widespread application in clinical settings and nursing homes to detect early signs of cognitive impairment and promote healthy aging.

轻度认知障碍(Mild cognitive impairment, MCl)是指超出预期的正常认知能力下降。MCl患者在6年内发展为阿尔茨海默病(AD)的比率估计为80%。然而，从老年人的正常认知中识别MCI仍然是早期AD检测的临床挑战。我们研究了一种基于患者步态和平衡的MCI检测新方法。我们的方法对Timed Up and Go测试(TUG)进行了全面的分析，基于Kinect v.2摄像头的首次应用，该摄像头记录并提供运动测量和机器学习，以区分两组患有轻度认知障碍和健康对照(HC)的老年人。当30名HC和25名MCI受试者进行TUG时，我们通过Kinect v.2摄像头收集了身体25个关节的运动数据。收集的数据提供了61个特征的步态和平衡测量的综合列表，包括TUG的持续时间，过渡阶段的持续时间和速度，以及微观和宏观的步态特征。我们的分析表明，有25个特征在MCI和HC受试者之间存在显著差异，其中20个特征是我们的相关分析所显示的独特特征。使用支持向量机(SVM)、随机森林(random forest)和人工神经网络(artificial neural network)三种不同分类器对MCI受试者进行分类的结果表明，使用SVM检测MCI受试者的准确率为94%，精密度为100%，f分数为93.33%，AUC为0.94。这些观察结果表明，我们的方法有可能作为一种低成本、易于使用的MCI筛查工具，客观地检测患AD的高风险受试者。这种工具非常适合广泛应用于临床环境和养老院，以发现认知障碍的早期迹象，促进健康老龄化。

{"title":"Detection of Mild Cognitive Impairment from Quantitative Analysis of Timed Up and Go (TUG)","authors":"Mahmoud Seifallahi, J. Galvin, B. Ghoraani","doi":"10.1109/ICDMW58026.2022.00042","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00042","url":null,"abstract":"Mild cognitive impairment (MCl) is abnormal cognitive decline beyond expected normal decline. The rate of progression to Alzheimer's disease (AD) in people with MCl is an estimated 80% in 6 years. However, identifying MCI from normal cognition in older adults remains a clinical challenge in early AD detection. We investigated a new method for detecting MCI based on patients' gait and balance. Our approach performs a comprehensive analysis of the Timed Up and Go test (TUG), based on the first application of a Kinect v.2 camera to record and provide movement measures and machine learning to differentiate between the two groups of older adults with MCI and healthy controls (HC). We collected movement data from 25 joints of the body via a Kinect v.2 camera as 30 HC and 25 MCI subjects performed TUG. The collected data provided a comprehensive list of gait and balance measures with 61 features, including duration of TUG, duration and velocity of transition phases, and micro and macro gait features. Our analysis evidenced that 25 features were significantly different between MCI and HC subjects, where 20 of them were unique features as indicated by our correlation analysis. The classification results using three different classifiers of support vector machine (SVM), random forest, and artificial neural network showed that the ability of our approach for detecting MCI subjects with the highest performance was using SVM with 94% accuracy, 100 % precision, 93.33% F-score, and 0.94 AUC. These observations suggest the possibility of our approach as a low-cost, easy-to-use MCI screening tool for objectively detecting subjects at high risk of developing AD. Such a tool is well-suited for widespread application in clinical settings and nursing homes to detect early signs of cognitive impairment and promote healthy aging.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125740847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Weight-Training Ensemble Model for Stock Price Forecast 股票价格预测的权重训练集合模型

2022 IEEE International Conference on Data Mining Workshops (ICDMW)

Pub Date : 2022-11-01 DOI: 10.1109/ICDMW58026.2022.00024

Jianing Zhao, Ayana Takai, E. Kita

The ensemble model is applied for the stock price prediction in this study. The proposed ensemble model is based on the weighted average estimation of the values predicted by base algorithms. The base algorithms include Linear Regression, Long Short-Term Memory (LSTM), Support Vector Regression (SVR) and lightGBM. The performance of the proposed model depends on the weight parameters. The past data are collected to calculate the weigh parameters for base models of the ensemble models. The stock price prediction of Toyota Motor Corporation is considered as the numerical examples. Then LSTM, SVR and LightGBM are built to recognize the trend of the weight sequence data and to predict the most suitable combination weights for ensemble. The experimental results show that any ensemble models achieves significantly better accuracy than each component model. The proposed model also achieved the lowest error than simple average and error-based combination method. Even a tiny difference in choosing associated combining weights can play a crucial role in linear combination of models for prediction.

本文采用集合模型对股票价格进行预测。所提出的集成模型是基于对基本算法预测值的加权平均估计。基本算法包括线性回归、长短期记忆(LSTM)、支持向量回归(SVR)和lightGBM。所提模型的性能取决于权重参数。收集过去的数据，计算出集合模型基本模型的权重参数。以丰田汽车公司的股价预测为数值实例。然后构建LSTM、SVR和LightGBM来识别权值序列数据的变化趋势，预测最适合集成的组合权值。实验结果表明，任意集成模型的精度都明显优于单个组件模型。与简单平均和基于误差的组合方法相比，该模型的误差最小。在选择相关组合权值时，即使是很小的差异也会在预测模型的线性组合中起关键作用。

引用次数: 0

Hybrid Oversampling Technique Based on Star Topology and Rejection Methodology for Classifying Imbalanced Data 基于星型拓扑的混合过采样技术和拒绝方法对不平衡数据进行分类

2022 IEEE International Conference on Data Mining Workshops (ICDMW)

Pub Date : 2022-11-01 DOI: 10.1109/ICDMW58026.2022.00033

Chaekyu Lee, Jaekwang Kim

In this paper, we propose the star topology and rejection method (STARM), a new oversampling technique that generally performs well for varying data and algorithms. STARM is a hybrid technique that combines the advantages of Polynom-fit-SMOTE, LEE, and SMOTE, all of which have yielded high performance based on different technical features, and eliminates their disadvantages. To verify that the proposed technique exhibits high performance in general situations, we conducted 28,028 experiments to compare the predictive performance of 77 oversampling techniques with four machine learning algorithms for 91 imbalanced datasets of various types. Consequently, STARM yielded the highest performance on average among the 77 techniques. In addition, it showed excellent performance even in various algorithms, various imbalanced ratios, and various data volumes.

在本文中，我们提出了星型拓扑和抑制方法(STARM)，这是一种新的过采样技术，通常对不同的数据和算法表现良好。STARM是一种混合技术，它结合了多项式拟合SMOTE、LEE和SMOTE的优点，所有这些优点都基于不同的技术特征产生了高性能，并消除了它们的缺点。为了验证所提出的技术在一般情况下表现出高性能，我们进行了28,028个实验，以比较77种过采样技术与四种机器学习算法对91种不同类型的不平衡数据集的预测性能。因此，在77种技术中，STARM的平均性能最高。此外，即使在各种算法、各种不平衡比率、各种数据量下，它也表现出优异的性能。

引用次数: 0

A study of the Dream Net model robustness across continual learning scenarios 梦网模型在持续学习场景下的稳健性研究

2022 IEEE International Conference on Data Mining Workshops (ICDMW)

Pub Date : 2022-11-01 DOI: 10.1109/ICDMW58026.2022.00111

M. Mainsant, M. Mermillod, C. Godin, M. Reyboz

Continual learning is one of the major challenges of deep learning. For decades, many studies have proposed efficient models overcoming catastrophic forgetting when learning new data. However, as they were focused on providing the best reduce-forgetting performance, studies have moved away from real-life applications where algorithms need to adapt to changing environments and perform, no matter the type of data arrival. Therefore, there is a growing need to define new scenarios to assess the robustness of existing methods with those challenges in mind. The issue of data availability during training is another essential point in the development of solid continual learning algorithms. Depending on the streaming formulation, the model needs in the more extreme scenarios to be able to adapt to new data as soon as it arrives and without the possibility to review it afterwards. In this study, we propose a review of existing continual learning scenarios and their associated terms. Those existing terms and definitions are synthesized in an atlas in order to provide a better overview. Based on two of the main categories defined in the atlas, “Class-IL.” and “Domain-IL”, we define eight different scenarios with data streams of varying complexity that allow to test the models robustness in changing data arrival scenarios. We choose to evaluate Dream Net - Data Free, a privacy-preserving continual learning algorithm, in each proposed scenario and demonstrate that this model is robust enough to succeed in every proposed scenario, regardless of how the data is presented. We also show that it is competitive with other continual learning literature algorithms that are not privacy preserving which is a clear advantage for real-life human-centered applications.

持续学习是深度学习的主要挑战之一。几十年来，许多研究提出了克服学习新数据时灾难性遗忘的有效模型。然而，由于他们专注于提供最佳的减少遗忘性能，研究已经远离了现实应用，在现实应用中，算法需要适应不断变化的环境并执行，无论数据到达的类型如何。因此，越来越需要定义新的场景来评估现有方法的鲁棒性，同时考虑到这些挑战。训练期间的数据可用性问题是开发可靠的持续学习算法的另一个要点。根据流公式的不同，在更极端的情况下，模型需要能够在新数据到达时立即适应它，并且不可能在之后对其进行审查。在这项研究中，我们提出了现有的持续学习情景及其相关术语的回顾。为了提供更好的概述，这些现有的术语和定义被综合在一个地图集中。基于地图集中定义的两个主要类别，“Class-IL”。和“Domain-IL”，我们定义了8种不同的场景，其中包含不同复杂性的数据流，允许在不断变化的数据到达场景中测试模型的鲁棒性。我们选择在每个提议的场景中评估Dream Net - Data Free，这是一种保护隐私的连续学习算法，并证明该模型足够健壮，可以在每个提议的场景中成功，无论数据如何呈现。我们还表明，它与其他不保护隐私的持续学习文献算法具有竞争力，这对于现实生活中以人为中心的应用来说是一个明显的优势。

{"title":"A study of the Dream Net model robustness across continual learning scenarios","authors":"M. Mainsant, M. Mermillod, C. Godin, M. Reyboz","doi":"10.1109/ICDMW58026.2022.00111","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00111","url":null,"abstract":"Continual learning is one of the major challenges of deep learning. For decades, many studies have proposed efficient models overcoming catastrophic forgetting when learning new data. However, as they were focused on providing the best reduce-forgetting performance, studies have moved away from real-life applications where algorithms need to adapt to changing environments and perform, no matter the type of data arrival. Therefore, there is a growing need to define new scenarios to assess the robustness of existing methods with those challenges in mind. The issue of data availability during training is another essential point in the development of solid continual learning algorithms. Depending on the streaming formulation, the model needs in the more extreme scenarios to be able to adapt to new data as soon as it arrives and without the possibility to review it afterwards. In this study, we propose a review of existing continual learning scenarios and their associated terms. Those existing terms and definitions are synthesized in an atlas in order to provide a better overview. Based on two of the main categories defined in the atlas, “Class-IL.” and “Domain-IL”, we define eight different scenarios with data streams of varying complexity that allow to test the models robustness in changing data arrival scenarios. We choose to evaluate Dream Net - Data Free, a privacy-preserving continual learning algorithm, in each proposed scenario and demonstrate that this model is robust enough to succeed in every proposed scenario, regardless of how the data is presented. We also show that it is competitive with other continual learning literature algorithms that are not privacy preserving which is a clear advantage for real-life human-centered applications.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"1513 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129317867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Incremental constrained clustering with application to remote sensing images time series 增量约束聚类在遥感影像时间序列中的应用

2022 IEEE International Conference on Data Mining Workshops (ICDMW)

Pub Date : 2022-11-01 DOI: 10.1109/ICDMW58026.2022.00110

Baptiste Lafabregue, P. Gançarski, J. Weber, G. Forestier

Automatically extracting knowledge from various datasets is a valuable task to help experts explore new types of data and save time on annotations. This is especially required for new topics such as emergency management or environmental monitoring. Traditional unsupervised methods often tend to not fulfill experts' intuitions or non-formalized knowledge. On the other hand, supervised methods tend to require a lot of knowledge to be efficient. Constrained clustering, a form of semi-supervised methods, mitigates these two effects, as it allows experts to inject their knowledge into the clustering process. However, constraints often have a poor effect on the result because it is hard for experts to give both informative and coherent constraints. Based on the idea that it is easier to criticize than to construct, this article presents a new method, I-SAMARAH, an incremental constrained clustering method. Through an iterative process, it alternates between a clustering phase where constraints are incorporated, and a criticize phase where the expert can give feedback on the clustering. We demonstrate experimentally the efficiency of our method on remote sensing image time series. We compare it to other constrained clustering methods in terms of result quality and to supervised methods in terms of number of annotations.

自动从各种数据集中提取知识是一项有价值的任务，可以帮助专家探索新类型的数据并节省注释时间。这对于诸如应急管理或环境监测等新主题尤其需要。传统的无监督方法往往不能满足专家的直觉或非形式化的知识。另一方面，监督方法往往需要大量的知识才能有效。约束聚类是半监督方法的一种形式，它可以减轻这两种影响，因为它允许专家将他们的知识注入聚类过程。然而，约束通常对结果的影响很差，因为专家很难给出信息和一致的约束。基于“批评容易构建难”的思想，本文提出了一种新的方法——增量约束聚类方法I-SAMARAH。通过迭代过程，它在包含约束的聚类阶段和专家可以对聚类给出反馈的批评阶段之间交替进行。实验证明了该方法在遥感图像时间序列上的有效性。我们将其与其他约束聚类方法在结果质量方面进行比较，并将其与监督方法在注释数量方面进行比较。

{"title":"Incremental constrained clustering with application to remote sensing images time series","authors":"Baptiste Lafabregue, P. Gançarski, J. Weber, G. Forestier","doi":"10.1109/ICDMW58026.2022.00110","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00110","url":null,"abstract":"Automatically extracting knowledge from various datasets is a valuable task to help experts explore new types of data and save time on annotations. This is especially required for new topics such as emergency management or environmental monitoring. Traditional unsupervised methods often tend to not fulfill experts' intuitions or non-formalized knowledge. On the other hand, supervised methods tend to require a lot of knowledge to be efficient. Constrained clustering, a form of semi-supervised methods, mitigates these two effects, as it allows experts to inject their knowledge into the clustering process. However, constraints often have a poor effect on the result because it is hard for experts to give both informative and coherent constraints. Based on the idea that it is easier to criticize than to construct, this article presents a new method, I-SAMARAH, an incremental constrained clustering method. Through an iterative process, it alternates between a clustering phase where constraints are incorporated, and a criticize phase where the expert can give feedback on the clustering. We demonstrate experimentally the efficiency of our method on remote sensing image time series. We compare it to other constrained clustering methods in terms of result quality and to supervised methods in terms of number of annotations.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132409988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Understanding Concept Identification as Consistent Data Clustering Across Multiple Feature Spaces 将概念识别理解为跨多个特征空间的一致数据聚类

2022 IEEE International Conference on Data Mining Workshops (ICDMW)

Pub Date : 2022-11-01 DOI: 10.1109/ICDMW58026.2022.00032

Felix Lanfermann, Sebastian Schmitt, Patricia Wollstadt

Identifying meaningful concepts in large data sets can provide valuable insights into engineering design problems. Concept identification aims at identifying non-overlapping groups of design instances that are similar in a joint space of all features, but which are also similar when considering only subsets of features. These subsets usually comprise features that characterize a design with respect to one specific context, for example, constructive design parameters, performance values, or operation modes. It is desirable to evaluate the quality of design concepts by considering several of these feature subsets in isolation. In particular, meaningful concepts should not only identify dense, well separated groups of data instances, but also provide non-overlapping groups of data that persist when considering pre-defined feature subsets separately. In this work, we propose to view concept identification as a special form of clustering algorithm with a broad range of potential applications beyond engineering design. To illustrate the differences between concept identification and classical clustering algorithms, we apply a recently proposed concept identification algorithm to two synthetic data sets and show the differences in identified solutions. In addition, we introduce the mutual information measure as a metric to evaluate whether solutions return consistent clusters across relevant subsets. To support the novel understanding of concept identification, we consider a simulated data set from a decision-making problem in the energy management domain and show that the identified clusters are more interpretable with respect to relevant feature subsets than clusters found by common clustering algorithms and are thus more suitable to support a decision maker.

在大数据集中识别有意义的概念可以为工程设计问题提供有价值的见解。概念识别旨在识别设计实例的非重叠组，这些设计实例在所有特征的联合空间中相似，但在仅考虑特征子集时也相似。这些子集通常包含与特定上下文相关的设计特征，例如，建设性设计参数、性能值或操作模式。通过孤立地考虑这些特征子集来评估设计概念的质量是可取的。特别是，有意义的概念不仅应该识别密集的、分离良好的数据实例组，而且还应该提供在单独考虑预定义特征子集时持续存在的非重叠数据组。在这项工作中，我们建议将概念识别视为一种特殊形式的聚类算法，在工程设计之外具有广泛的潜在应用。为了说明概念识别与经典聚类算法之间的差异，我们将最近提出的概念识别算法应用于两个合成数据集，并展示了识别解的差异。此外，我们引入互信息度量作为度量来评估解决方案是否在相关子集之间返回一致的聚类。为了支持对概念识别的新理解，我们考虑了来自能源管理领域决策问题的模拟数据集，并表明识别的聚类比普通聚类算法发现的聚类在相关特征子集方面更具可解释性，因此更适合支持决策者。

{"title":"Understanding Concept Identification as Consistent Data Clustering Across Multiple Feature Spaces","authors":"Felix Lanfermann, Sebastian Schmitt, Patricia Wollstadt","doi":"10.1109/ICDMW58026.2022.00032","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00032","url":null,"abstract":"Identifying meaningful concepts in large data sets can provide valuable insights into engineering design problems. Concept identification aims at identifying non-overlapping groups of design instances that are similar in a joint space of all features, but which are also similar when considering only subsets of features. These subsets usually comprise features that characterize a design with respect to one specific context, for example, constructive design parameters, performance values, or operation modes. It is desirable to evaluate the quality of design concepts by considering several of these feature subsets in isolation. In particular, meaningful concepts should not only identify dense, well separated groups of data instances, but also provide non-overlapping groups of data that persist when considering pre-defined feature subsets separately. In this work, we propose to view concept identification as a special form of clustering algorithm with a broad range of potential applications beyond engineering design. To illustrate the differences between concept identification and classical clustering algorithms, we apply a recently proposed concept identification algorithm to two synthetic data sets and show the differences in identified solutions. In addition, we introduce the mutual information measure as a metric to evaluate whether solutions return consistent clusters across relevant subsets. To support the novel understanding of concept identification, we consider a simulated data set from a decision-making problem in the energy management domain and show that the identified clusters are more interpretable with respect to relevant feature subsets than clusters found by common clustering algorithms and are thus more suitable to support a decision maker.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116344041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

On the Deployment of Post-Disaster Building Damage Assessment Tools using Satellite Imagery: A Deep Learning Approach 利用卫星图像部署灾后建筑损害评估工具:一种深度学习方法

2022 IEEE International Conference on Data Mining Workshops (ICDMW)

Pub Date : 2022-11-01 DOI: 10.1109/ICDMW58026.2022.00134

Shahrzad Gholami, Caleb Robinson, Anthony Ortiz, Siyu Yang, J. Margutti, Cameron Birge, R. Dodhia, J. Ferres

Natural disasters frequency is growing globally. Every year 350 million people are affected and billions of dollars of damage is incurred. Providing timely and appropriate humanitarian interventions like shelters, medical aid, and food to affected communities are challenging problems. AI frameworks can help support existing efforts in solving these problems in various ways. In this study, we propose using high-resolution satellite imagery from before and after disasters to develop a convolutional neural network model for localizing buildings and scoring their damage level. We categorize damage to buildings into four levels, spanning from not damaged to destroyed, based on the xView2 dataset's scale. Due to the emergency nature of disaster response efforts, the value of automating damage assessment lies primarily in the inference speed, rather than accuracy. We show that our proposed solution works three times faster than the fastest xView2 challenge winning solution and over 50 times faster than the slowest first place solution, which indicates a significant improvement from an operational viewpoint. Our proposed model achieves a pixel-wise Fl score of 0.74 for the building localization and a pixel-wise harmonic Fl score of 0.6 for damage classification and uses a simpler architecture compared to other studies. Additionally, we develop a web-based visualizer that can display the before and after imagery along with the model's building damage predictions on a custom map. This study has been collaboratively conducted to empower a humanitarian organization as the stakeholder, that plans to deploy and assess the model along with the visualizer for their disaster response efforts in the field.

全球自然灾害发生频率日益增加。每年有3.5亿人受到影响，造成数十亿美元的损失。向受影响社区提供及时和适当的人道主义干预措施，如庇护所、医疗援助和食品，是具有挑战性的问题。人工智能框架可以以各种方式帮助支持解决这些问题的现有努力。在这项研究中，我们建议使用灾难前后的高分辨率卫星图像来开发卷积神经网络模型，用于定位建筑物并对其损坏程度进行评分。根据xView2数据集的规模，我们将建筑物的损坏分为四个级别，从未损坏到已损坏。由于灾害响应工作的紧急性，自动化损害评估的价值主要在于推理速度，而不是准确性。我们表明，我们提出的解决方案比最快的xView2挑战获胜解决方案快3倍，比最慢的第一名解决方案快50倍以上，这表明从操作角度来看有了重大改进。与其他研究相比，我们提出的模型在建筑物定位方面的像素级Fl得分为0.74，在损伤分类方面的像素级谐波Fl得分为0.6，并且使用了更简单的架构。此外，我们开发了一个基于网络的可视化工具，可以在自定义地图上显示前后的图像以及模型的建筑物损坏预测。这项研究是合作进行的，目的是授权一个人道主义组织作为利益相关者，计划部署和评估模型以及可视化工具，用于他们在现场的灾难响应工作。

{"title":"On the Deployment of Post-Disaster Building Damage Assessment Tools using Satellite Imagery: A Deep Learning Approach","authors":"Shahrzad Gholami, Caleb Robinson, Anthony Ortiz, Siyu Yang, J. Margutti, Cameron Birge, R. Dodhia, J. Ferres","doi":"10.1109/ICDMW58026.2022.00134","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00134","url":null,"abstract":"Natural disasters frequency is growing globally. Every year 350 million people are affected and billions of dollars of damage is incurred. Providing timely and appropriate humanitarian interventions like shelters, medical aid, and food to affected communities are challenging problems. AI frameworks can help support existing efforts in solving these problems in various ways. In this study, we propose using high-resolution satellite imagery from before and after disasters to develop a convolutional neural network model for localizing buildings and scoring their damage level. We categorize damage to buildings into four levels, spanning from not damaged to destroyed, based on the xView2 dataset's scale. Due to the emergency nature of disaster response efforts, the value of automating damage assessment lies primarily in the inference speed, rather than accuracy. We show that our proposed solution works three times faster than the fastest xView2 challenge winning solution and over 50 times faster than the slowest first place solution, which indicates a significant improvement from an operational viewpoint. Our proposed model achieves a pixel-wise Fl score of 0.74 for the building localization and a pixel-wise harmonic Fl score of 0.6 for damage classification and uses a simpler architecture compared to other studies. Additionally, we develop a web-based visualizer that can display the before and after imagery along with the model's building damage predictions on a custom map. This study has been collaboratively conducted to empower a humanitarian organization as the stakeholder, that plans to deploy and assess the model along with the visualizer for their disaster response efforts in the field.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123438009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1