首页 > 最新文献

International Journal of Data Warehousing and Mining最新文献

英文 中文
Spatiotemporal Data Prediction Model Based on a Multi-Layer Attention Mechanism 基于多层注意机制的时空数据预测模型
IF 1.2 4区 计算机科学 Q3 Computer Science Pub Date : 2023-01-16 DOI: 10.4018/ijdwm.315822
Man Jiang, Qilong Han, Haitao Zhang, Hexiang Liu
Spatiotemporal data prediction is of great significance in the fields of smart cities and smart manufacturing. Current spatiotemporal data prediction models heavily rely on traditional spatial views or single temporal granularity, which suffer from missing knowledge, including dynamic spatial correlations, periodicity, and mutability. This paper addresses these challenges by proposing a multi-layer attention-based predictive model. The key idea of this paper is to use a multi-layer attention mechanism to model the dynamic spatial correlation of different features. Then, multi-granularity historical features are fused to predict future spatiotemporal data. Experiments on real-world data show that the proposed model outperforms six state-of-the-art benchmark methods.
时空数据预测在智慧城市、智能制造等领域具有重要意义。当前的时空数据预测模型严重依赖于传统的空间视图或单一的时间粒度,存在空间动态相关性、周期性和可变性等知识缺失问题。本文通过提出一个多层的基于注意力的预测模型来解决这些挑战。本文的核心思想是利用多层注意机制对不同特征之间的动态空间关联进行建模。然后,融合多粒度历史特征对未来时空数据进行预测。在实际数据上的实验表明,所提出的模型优于六种最先进的基准方法。
{"title":"Spatiotemporal Data Prediction Model Based on a Multi-Layer Attention Mechanism","authors":"Man Jiang, Qilong Han, Haitao Zhang, Hexiang Liu","doi":"10.4018/ijdwm.315822","DOIUrl":"https://doi.org/10.4018/ijdwm.315822","url":null,"abstract":"Spatiotemporal data prediction is of great significance in the fields of smart cities and smart manufacturing. Current spatiotemporal data prediction models heavily rely on traditional spatial views or single temporal granularity, which suffer from missing knowledge, including dynamic spatial correlations, periodicity, and mutability. This paper addresses these challenges by proposing a multi-layer attention-based predictive model. The key idea of this paper is to use a multi-layer attention mechanism to model the dynamic spatial correlation of different features. Then, multi-granularity historical features are fused to predict future spatiotemporal data. Experiments on real-world data show that the proposed model outperforms six state-of-the-art benchmark methods.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2023-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82940997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A New Outlier Detection Algorithm Based on Fast Density Peak Clustering Outlier Factor 基于快速密度峰聚类离群因子的离群点检测新算法
IF 1.2 4区 计算机科学 Q3 Computer Science Pub Date : 2023-01-13 DOI: 10.4018/ijdwm.316534
Zhongping Zhang, Sen Li, Weixiong Liu, Y. Wang, Daisy Xin Li
Outlier detection is an important field in data mining, which can be used in fraud detection, fault detection, and other fields. This article focuses on the problem where the density peak clustering algorithm needs a manual parameter setting and time complexity is high; the first is to use the k nearest neighbors clustering algorithm to replace the density peak of the density estimate, which adopts the KD-Tree index data structure calculation of data objects k close neighbors. Then it adopts the method of the product of density and distance automatic selection of clustering centers. In addition, the central relative distance and fast density peak clustering outliers were defined to characterize the degree of outliers of data objects. Then, based on fast density peak clustering outliers, an outlier detection algorithm was devised. Experiments on artificial and real data sets are performed to validate the algorithm, and the validity and time efficiency of the proposed algorithm are validated when compared to several conventional and innovative algorithms.
离群点检测是数据挖掘中的一个重要领域,可用于欺诈检测、故障检测等领域。本文主要研究了密度峰聚类算法需要手动设置参数和时间复杂度高的问题;第一种是使用k近邻聚类算法替换密度峰值的密度估计,其中采用KD-Tree索引数据结构计算数据对象k近邻。然后采用密度与距离乘积的方法自动选择聚类中心;此外,还定义了中心相对距离和快速密度峰聚类离群点来表征数据对象离群点的程度。然后,基于快速密度峰聚类离群点,设计了一种离群点检测算法。在人工数据集和真实数据集上进行了实验验证,并与几种传统算法和创新算法进行了对比,验证了该算法的有效性和时间效率。
{"title":"A New Outlier Detection Algorithm Based on Fast Density Peak Clustering Outlier Factor","authors":"Zhongping Zhang, Sen Li, Weixiong Liu, Y. Wang, Daisy Xin Li","doi":"10.4018/ijdwm.316534","DOIUrl":"https://doi.org/10.4018/ijdwm.316534","url":null,"abstract":"Outlier detection is an important field in data mining, which can be used in fraud detection, fault detection, and other fields. This article focuses on the problem where the density peak clustering algorithm needs a manual parameter setting and time complexity is high; the first is to use the k nearest neighbors clustering algorithm to replace the density peak of the density estimate, which adopts the KD-Tree index data structure calculation of data objects k close neighbors. Then it adopts the method of the product of density and distance automatic selection of clustering centers. In addition, the central relative distance and fast density peak clustering outliers were defined to characterize the degree of outliers of data objects. Then, based on fast density peak clustering outliers, an outlier detection algorithm was devised. Experiments on artificial and real data sets are performed to validate the algorithm, and the validity and time efficiency of the proposed algorithm are validated when compared to several conventional and innovative algorithms.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2023-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78453444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Estimating the Number of Clusters in High-Dimensional Large Datasets 高维大型数据集中聚类数量的估计
IF 1.2 4区 计算机科学 Q3 Computer Science Pub Date : 2023-01-13 DOI: 10.4018/ijdwm.316142
Xutong Zhu, Lingli Li
Clustering is a basic primer of exploratory tasks. In order to obtain valuable results, the parameters in the clustering algorithm, the number of clusters must be set appropriately. Existing methods for determining the number of clusters perform well on low-dimensional small datasets, but how to effectively determine the optimal number of clusters on large high-dimensional datasets is still a challenging problem. In this paper, the authors design a method for effectively estimating the optimal number of clusters on large-scale high-dimensional datasets that can overcome the shortcomings of existing estimation methods and accurately and quickly estimate the optimal number of clusters on large-scale high-dimensional datasets. Extensive experiments show that it (1) outperforms existing estimation methods in accuracy and efficiency, (2) generalizes across different datasets, and (3) is suitable for high-dimensional large datasets.
聚类是探索性任务的基本入门。为了获得有价值的结果,聚类算法中的参数、聚类数量必须设置得当。现有的聚类数量确定方法在低维小数据集上表现良好,但如何在大型高维数据集上有效确定最优聚类数量仍然是一个具有挑战性的问题。本文设计了一种有效估计大规模高维数据集上最优聚类数的方法,克服了现有估计方法的不足,能够准确、快速地估计大规模高维数据集上的最优聚类数。大量的实验表明:(1)在精度和效率上优于现有的估计方法;(2)在不同的数据集上泛化;(3)适用于高维大数据集。
{"title":"Estimating the Number of Clusters in High-Dimensional Large Datasets","authors":"Xutong Zhu, Lingli Li","doi":"10.4018/ijdwm.316142","DOIUrl":"https://doi.org/10.4018/ijdwm.316142","url":null,"abstract":"Clustering is a basic primer of exploratory tasks. In order to obtain valuable results, the parameters in the clustering algorithm, the number of clusters must be set appropriately. Existing methods for determining the number of clusters perform well on low-dimensional small datasets, but how to effectively determine the optimal number of clusters on large high-dimensional datasets is still a challenging problem. In this paper, the authors design a method for effectively estimating the optimal number of clusters on large-scale high-dimensional datasets that can overcome the shortcomings of existing estimation methods and accurately and quickly estimate the optimal number of clusters on large-scale high-dimensional datasets. Extensive experiments show that it (1) outperforms existing estimation methods in accuracy and efficiency, (2) generalizes across different datasets, and (3) is suitable for high-dimensional large datasets.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2023-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88955315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Ensemble Approach for Prediction of Cardiovascular Disease Using Meta Classifier Boosting Algorithms 基于Meta-Classifier-Boosting算法的心血管疾病综合预测方法
IF 1.2 4区 计算机科学 Q3 Computer Science Pub Date : 2023-01-13 DOI: 10.4018/ijdwm.316145
Sibo Prasad Patro, Neelamadhab Padhy, Rahul Deo Sah
There are very few studies are carried for investigating the potential of hybrid ensemble machine learning techniques for building a model for the detection and prediction of heart disease in the human body. In this research, the authors deal with a classification problem that is a hybridization of fusion-based ensemble model with machine learning approaches, which produces a more trustworthy ensemble than the original ensemble model and outperforms previous heart disease prediction models. The proposed model is evaluated on the Cleveland heart disease dataset using six boosting techniques named XGBoost, AdaBoost, Gradient Boosting, LightGBM, CatBoost, and Histogram-Based Gradient Boosting. Hybridization produces superior results under consideration of classification algorithms. The remarkable accuracies of 96.51% for training and 93.37% for testing have been achieved by the Meta-XGBoost classifier.
很少有研究调查混合集成机器学习技术在建立人体心脏病检测和预测模型方面的潜力。在这项研究中,作者处理了一个分类问题,该问题是基于融合的集成模型与机器学习方法的混合,它产生了比原始集成模型更值得信赖的集成,并且优于以前的心脏病预测模型。所提出的模型在克利夫兰心脏病数据集上使用六种增强技术进行评估,分别为XGBoost、AdaBoost、梯度增强、LightGBM、CatBoost和基于直方图的梯度增强。在考虑分类算法的情况下,杂交产生了优越的结果。Meta XGBoost分类器的训练准确率为96.51%,测试准确率为93.37%。
{"title":"An Ensemble Approach for Prediction of Cardiovascular Disease Using Meta Classifier Boosting Algorithms","authors":"Sibo Prasad Patro, Neelamadhab Padhy, Rahul Deo Sah","doi":"10.4018/ijdwm.316145","DOIUrl":"https://doi.org/10.4018/ijdwm.316145","url":null,"abstract":"There are very few studies are carried for investigating the potential of hybrid ensemble machine learning techniques for building a model for the detection and prediction of heart disease in the human body. In this research, the authors deal with a classification problem that is a hybridization of fusion-based ensemble model with machine learning approaches, which produces a more trustworthy ensemble than the original ensemble model and outperforms previous heart disease prediction models. The proposed model is evaluated on the Cleveland heart disease dataset using six boosting techniques named XGBoost, AdaBoost, Gradient Boosting, LightGBM, CatBoost, and Histogram-Based Gradient Boosting. Hybridization produces superior results under consideration of classification algorithms. The remarkable accuracies of 96.51% for training and 93.37% for testing have been achieved by the Meta-XGBoost classifier.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2023-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45477411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Efficient Association Rule Mining-Based Spatial Keyword Index 基于关联规则挖掘的高效空间关键字索引
IF 1.2 4区 计算机科学 Q3 Computer Science Pub Date : 2023-01-13 DOI: 10.4018/ijdwm.316161
Lianyin Jia, Haotian Tang, Mengjuan Li, Bingxin Zhao, S. Wei, Haihe Zhou
Spatial keyword query has attracted the attention of many researchers. Most of the existing spatial keyword indexes do not consider the differences in keyword distribution, so their efficiencies are not high when data are skewed. To this end, this paper proposes a novel association rule mining based spatial keyword index, ARM-SQ, whose inverted lists are materialized by the frequent item sets mined by association rules; thus, intersections of long lists can be avoided. To prevent excessive space costs caused by materialization, a depth-based materialization strategy is introduced, which maintains a good balance between query and space costs. To select the right frequent item sets for answering a query, the authors further implement a benefit-based greedy frequent item set selection algorithm, BGF-Selection. The experimental results show that this algorithm significantly outperforms the existing algorithms, and its efficiency can be an order of magnitude higher than SFC-Quad.
空间关键字查询引起了众多研究者的关注。现有的空间关键字索引大多没有考虑关键字分布的差异性,在数据偏斜时效率不高。为此,本文提出了一种新的基于关联规则挖掘的空间关键字索引ARM-SQ,该索引的倒排表由关联规则挖掘的频繁项集具体化;因此,可以避免长列表的交叉。为了防止物化造成过多的空间成本,引入了一种基于深度的物化策略,在查询成本和空间成本之间保持了良好的平衡。为了选择正确的频繁项集来回答查询,作者进一步实现了一种基于利益的贪婪频繁项集选择算法BGF-Selection。实验结果表明,该算法明显优于现有算法,其效率比SFC-Quad高出一个数量级。
{"title":"An Efficient Association Rule Mining-Based Spatial Keyword Index","authors":"Lianyin Jia, Haotian Tang, Mengjuan Li, Bingxin Zhao, S. Wei, Haihe Zhou","doi":"10.4018/ijdwm.316161","DOIUrl":"https://doi.org/10.4018/ijdwm.316161","url":null,"abstract":"Spatial keyword query has attracted the attention of many researchers. Most of the existing spatial keyword indexes do not consider the differences in keyword distribution, so their efficiencies are not high when data are skewed. To this end, this paper proposes a novel association rule mining based spatial keyword index, ARM-SQ, whose inverted lists are materialized by the frequent item sets mined by association rules; thus, intersections of long lists can be avoided. To prevent excessive space costs caused by materialization, a depth-based materialization strategy is introduced, which maintains a good balance between query and space costs. To select the right frequent item sets for answering a query, the authors further implement a benefit-based greedy frequent item set selection algorithm, BGF-Selection. The experimental results show that this algorithm significantly outperforms the existing algorithms, and its efficiency can be an order of magnitude higher than SFC-Quad.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2023-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88182495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
RETAD: Vehicle Trajectory Anomaly Detection Based on Reconstruction Error RETAD:基于重构误差的车辆轨迹异常检测
IF 1.2 4区 计算机科学 Q3 Computer Science Pub Date : 2023-01-13 DOI: 10.4018/ijdwm.316460
Chaoneng Li, Guanwen Feng, Yiran Jia, Yunan Li, Jian Ji, Qiguang Miao
Due to the rapid advancement of wireless sensor and location technologies, a large amount of mobile agent trajectory data has become available. Intelligent city systems and video surveillance all benefit from trajectory anomaly detection. The authors propose an unsupervised reconstruction error-based trajectory anomaly detection (RETAD) method for vehicles to address the issues of conventional anomaly detection, which include difficulty extracting features, are susceptible to overfitting, and have a poor anomaly detection effect. RETAD reconstructs the original vehicle trajectories through an autoencoder based on recurrent neural networks. The model obtains moving patterns of normal trajectories by eliminating the gap between the reconstruction results and the initial inputs. Anomalous trajectories are defined as those with a reconstruction error larger than anomaly threshold. Experimental results demonstrate that the effectiveness of RETAD in detecting anomalies is superior to traditional distance-based, density-based, and machine learning classification algorithms on multiple metrics.
由于无线传感器和定位技术的快速发展,大量的移动智能体轨迹数据已经可用。智能城市系统和视频监控都受益于轨迹异常检测。针对传统异常检测方法存在特征提取困难、易出现过拟合、异常检测效果差等问题,提出了一种基于无监督重构误差的车辆轨迹异常检测方法。RETAD通过基于循环神经网络的自编码器重构原始车辆轨迹。该模型通过消除重建结果与初始输入之间的差距来获得法向轨迹的运动模式。异常轨迹是指重建误差大于异常阈值的轨迹。实验结果表明,RETAD在检测异常方面的有效性优于传统的基于距离、基于密度和机器学习的多指标分类算法。
{"title":"RETAD: Vehicle Trajectory Anomaly Detection Based on Reconstruction Error","authors":"Chaoneng Li, Guanwen Feng, Yiran Jia, Yunan Li, Jian Ji, Qiguang Miao","doi":"10.4018/ijdwm.316460","DOIUrl":"https://doi.org/10.4018/ijdwm.316460","url":null,"abstract":"Due to the rapid advancement of wireless sensor and location technologies, a large amount of mobile agent trajectory data has become available. Intelligent city systems and video surveillance all benefit from trajectory anomaly detection. The authors propose an unsupervised reconstruction error-based trajectory anomaly detection (RETAD) method for vehicles to address the issues of conventional anomaly detection, which include difficulty extracting features, are susceptible to overfitting, and have a poor anomaly detection effect. RETAD reconstructs the original vehicle trajectories through an autoencoder based on recurrent neural networks. The model obtains moving patterns of normal trajectories by eliminating the gap between the reconstruction results and the initial inputs. Anomalous trajectories are defined as those with a reconstruction error larger than anomaly threshold. Experimental results demonstrate that the effectiveness of RETAD in detecting anomalies is superior to traditional distance-based, density-based, and machine learning classification algorithms on multiple metrics.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2023-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70455534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Personal Health and Illness Management and the Future Vision of Biomedical Clothing Based on WSN 基于WSN的个人健康和疾病管理及生物医学服装的未来展望
IF 1.2 4区 计算机科学 Q3 Computer Science Pub Date : 2023-01-01 DOI: 10.4018/ijdwm.316126
Ge Zhang, Zubin Ning
It is essential to have a fast, reliable, and energy-efficient connection between wireless sensor networks (WSNs). Control specifications, networking layers, media access control, and physical layers should be optimised or co-designed. Health insurance will become more expensive for individuals with lower incomes. There are privacy and cyber security issues, an increased risk of malpractice lawsuits, and more costs in terms of both time and money for doctors and patients. In this paper, personal health biomedical clothing based on wireless sensor networks (PH-BC-WSN) was used to enhance access to quality health care, boost food production through precision agriculture, and improve the quality of human resources. The internet of things enables the creation of healthcare and medical asset monitoring systems that are more efficient. There was extensive discussion of medical data eavesdropping, manipulation, fabrication of warnings, denial of services, position and tracker of users, physical interference with devices, and electromagnetic attacks.
在无线传感器网络(wsn)之间建立快速、可靠和节能的连接至关重要。控制规范、网络层、媒体访问控制和物理层应进行优化或共同设计。对于收入较低的个人来说,医疗保险将变得更加昂贵。还有隐私和网络安全问题,医疗事故诉讼的风险增加,以及医生和患者在时间和金钱方面的更多成本。本文将基于无线传感器网络(PH-BC-WSN)的个人健康生物医学服装用于提高优质医疗保健的可及性,通过精准农业促进粮食生产,提高人力资源质量。物联网使创建更高效的医疗保健和医疗资产监控系统成为可能。会议广泛讨论了医疗数据窃听、操纵、捏造警告、拒绝服务、定位和跟踪用户、对设备的物理干扰以及电磁攻击等问题。
{"title":"Personal Health and Illness Management and the Future Vision of Biomedical Clothing Based on WSN","authors":"Ge Zhang, Zubin Ning","doi":"10.4018/ijdwm.316126","DOIUrl":"https://doi.org/10.4018/ijdwm.316126","url":null,"abstract":"It is essential to have a fast, reliable, and energy-efficient connection between wireless sensor networks (WSNs). Control specifications, networking layers, media access control, and physical layers should be optimised or co-designed. Health insurance will become more expensive for individuals with lower incomes. There are privacy and cyber security issues, an increased risk of malpractice lawsuits, and more costs in terms of both time and money for doctors and patients. In this paper, personal health biomedical clothing based on wireless sensor networks (PH-BC-WSN) was used to enhance access to quality health care, boost food production through precision agriculture, and improve the quality of human resources. The internet of things enables the creation of healthcare and medical asset monitoring systems that are more efficient. There was extensive discussion of medical data eavesdropping, manipulation, fabrication of warnings, denial of services, position and tracker of users, physical interference with devices, and electromagnetic attacks.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83531376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Loan Default Prediction Based on Convolutional Neural Network and LightGBM 基于卷积神经网络和LightGBM的贷款违约预测
IF 1.2 4区 计算机科学 Q3 Computer Science Pub Date : 2023-01-01 DOI: 10.4018/ijdwm.315823
Q. Zhu, Wenhao Ding, Mingsen Xiang, M. Hu, Ning Zhang
With the change of people's consumption mode, credit consumption has gradually become a new consumption trend. Frequent loan defaults give default prediction more and more attention. This paper proposes a new comprehensive prediction method of loan default. This method combines convolutional neural network and LightGBM algorithm to establish a prediction model. Firstly, the excellent feature extraction ability of convolutional neural network is used to extract features from the original loan data and generate a new feature matrix. Secondly, the new feature matrix is used as input data, and the parameters of LightGBM algorithm are adjusted through grid search so as to build the LightGBM model. Finally, the LightGBM model is trained based on the new feature matrix, and the CNN-LightGBM loan default prediction model is obtained. To verify the effectiveness and superiority of our model, a series of experiments were conducted to compare the proposed prediction model with four classical models. The results show that CNN-LightGBM model is superior to other models in all evaluation indexes.
随着人们消费方式的转变,信用消费逐渐成为一种新的消费趋势。频繁的贷款违约使违约预测越来越受到人们的关注。本文提出了一种新的贷款违约综合预测方法。该方法结合卷积神经网络和LightGBM算法建立预测模型。首先,利用卷积神经网络出色的特征提取能力,从原始贷款数据中提取特征,生成新的特征矩阵;其次,将新的特征矩阵作为输入数据,通过网格搜索调整LightGBM算法的参数,建立LightGBM模型;最后,基于新的特征矩阵对LightGBM模型进行训练,得到CNN-LightGBM贷款违约预测模型。为了验证该模型的有效性和优越性,进行了一系列的实验,将所提出的预测模型与四种经典模型进行了比较。结果表明,CNN-LightGBM模型在各评价指标上均优于其他模型。
{"title":"Loan Default Prediction Based on Convolutional Neural Network and LightGBM","authors":"Q. Zhu, Wenhao Ding, Mingsen Xiang, M. Hu, Ning Zhang","doi":"10.4018/ijdwm.315823","DOIUrl":"https://doi.org/10.4018/ijdwm.315823","url":null,"abstract":"With the change of people's consumption mode, credit consumption has gradually become a new consumption trend. Frequent loan defaults give default prediction more and more attention. This paper proposes a new comprehensive prediction method of loan default. This method combines convolutional neural network and LightGBM algorithm to establish a prediction model. Firstly, the excellent feature extraction ability of convolutional neural network is used to extract features from the original loan data and generate a new feature matrix. Secondly, the new feature matrix is used as input data, and the parameters of LightGBM algorithm are adjusted through grid search so as to build the LightGBM model. Finally, the LightGBM model is trained based on the new feature matrix, and the CNN-LightGBM loan default prediction model is obtained. To verify the effectiveness and superiority of our model, a series of experiments were conducted to compare the proposed prediction model with four classical models. The results show that CNN-LightGBM model is superior to other models in all evaluation indexes.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90475469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Top-K Pseudo Labeling for Semi-Supervised Image Classification 半监督图像分类的Top-K伪标记
IF 1.2 4区 计算机科学 Q3 Computer Science Pub Date : 2022-12-30 DOI: 10.4018/ijdwm.316150
Yi Jiang, Hui Sun
In this paper, a top-k pseudo labeling method for semi-supervised self-learning is proposed. Pseudo labeling is a key technology in semi-supervised self-learning. Briefly, the quality of the pseudo label generated largely determined the convergence of the neural network and the accuracy obtained. In this paper, the authors use a method called top-k pseudo labeling to generate pseudo label during the training of semi-supervised neural network model. The proposed labeling method helps a lot in learning features from unlabeled data. The proposed method is easy to implement and only relies on the neural network prediction and hyper-parameter k. The experiment results show that the proposed method works well with semi-supervised learning on CIFAR-10 and CIFAR-100 datasets. Also, a variant of top-k labeling for supervised learning named top-k regulation is proposed. The experiment results show that various models can achieve higher accuracy on test set when trained with top-k regulation.
提出了一种用于半监督自学习的top-k伪标注方法。伪标注是半监督自学习中的一项关键技术。简而言之,生成的伪标签的质量在很大程度上决定了神经网络的收敛性和获得的精度。在本文中,作者在半监督神经网络模型的训练过程中使用了一种称为top-k伪标记的方法来生成伪标记。所提出的标注方法有助于从未标注的数据中学习特征。该方法实现简单,仅依赖于神经网络预测和超参数k。实验结果表明,该方法在CIFAR-10和CIFAR-100数据集上的半监督学习效果良好。此外,还提出了一种用于监督学习的top-k标记的变体,称为top-k调节。实验结果表明,通过top-k调节训练,各种模型在测试集上都能达到较高的准确率。
{"title":"Top-K Pseudo Labeling for Semi-Supervised Image Classification","authors":"Yi Jiang, Hui Sun","doi":"10.4018/ijdwm.316150","DOIUrl":"https://doi.org/10.4018/ijdwm.316150","url":null,"abstract":"In this paper, a top-k pseudo labeling method for semi-supervised self-learning is proposed. Pseudo labeling is a key technology in semi-supervised self-learning. Briefly, the quality of the pseudo label generated largely determined the convergence of the neural network and the accuracy obtained. In this paper, the authors use a method called top-k pseudo labeling to generate pseudo label during the training of semi-supervised neural network model. The proposed labeling method helps a lot in learning features from unlabeled data. The proposed method is easy to implement and only relies on the neural network prediction and hyper-parameter k. The experiment results show that the proposed method works well with semi-supervised learning on CIFAR-10 and CIFAR-100 datasets. Also, a variant of top-k labeling for supervised learning named top-k regulation is proposed. The experiment results show that various models can achieve higher accuracy on test set when trained with top-k regulation.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2022-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90776239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Method for Generating Comparison Tables From the Semantic Web 一种从语义网生成比较表的方法
IF 1.2 4区 计算机科学 Q3 Computer Science Pub Date : 2022-04-01 DOI: 10.4018/ijdwm.298008
A. Giacometti, Béatrice Bouchou-Markhoff, Arnaud Soulet
This paper presents Versus, which is the first automatic method for generating comparison tables from knowledge bases of the Semantic Web. For this purpose, it introduces the contextual reference level to evaluate whether a feature is relevant to compare a set of entities. This measure relies on contexts that are sets of entities similar to the compared entities. Its principle is to favor the features whose values for the compared entities are reference (or frequent) in these contexts. The proposal efficiently evaluates the contextual reference level from a public SPARQL endpoint limited by a fair-use policy. Using a new benchmark based on Wikidata, the experiments show the interest of the contextual reference level for identifying the features deemed relevant by users with high precision and recall. In addition, the proposed optimizations significantly reduce the number of required queries for properties as well as for inverse relations. Interestingly, this experimental study also show that the inverse relations bring out a large number of numerical comparison features.
本文提出了第一个基于语义网知识库自动生成比较表的方法Versus。为此,它引入了上下文引用级别,以评估一个特性是否与比较一组实体相关。此度量依赖于上下文,这些上下文是与被比较实体相似的实体集。其原则是优先考虑比较实体的值在这些上下文中是引用(或频繁)的特征。该建议有效地从受合理使用策略限制的公共SPARQL端点评估上下文引用级别。使用基于Wikidata的新基准,实验表明上下文参考水平对识别用户认为相关的特征具有较高的准确性和召回率。此外,建议的优化显著减少了属性和逆关系所需查询的数量。有趣的是,本实验研究还表明,反比关系带来了大量的数值比较特征。
{"title":"A Method for Generating Comparison Tables From the Semantic Web","authors":"A. Giacometti, Béatrice Bouchou-Markhoff, Arnaud Soulet","doi":"10.4018/ijdwm.298008","DOIUrl":"https://doi.org/10.4018/ijdwm.298008","url":null,"abstract":"This paper presents Versus, which is the first automatic method for generating comparison tables from knowledge bases of the Semantic Web. For this purpose, it introduces the contextual reference level to evaluate whether a feature is relevant to compare a set of entities. This measure relies on contexts that are sets of entities similar to the compared entities. Its principle is to favor the features whose values for the compared entities are reference (or frequent) in these contexts. The proposal efficiently evaluates the contextual reference level from a public SPARQL endpoint limited by a fair-use policy. Using a new benchmark based on Wikidata, the experiments show the interest of the contextual reference level for identifying the features deemed relevant by users with high precision and recall. In addition, the proposed optimizations significantly reduce the number of required queries for properties as well as for inverse relations. Interestingly, this experimental study also show that the inverse relations bring out a large number of numerical comparison features.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83159465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
International Journal of Data Warehousing and Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1