首页 > 最新文献

International Journal of Data Warehousing and Mining最新文献

英文 中文
Fusing Syntax and Semantics-Based Graph Convolutional Network for Aspect-Based Sentiment Analysis 融合语法和语义的图卷积网络用于面向方面的情感分析
IF 1.2 4区 计算机科学 Q4 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2023-03-17 DOI: 10.4018/ijdwm.319803
Jinhui Feng, Shaohua Cai, Kuntao Li, Yifan Chen, Qianhua Cai, Hongya Zhao
Aspect-based sentiment analysis (ABSA) aims to classify the sentiment polarity of a given aspect in a sentence or document, which is a fine-grained task of natural language processing. Recent ABSA methods mainly focus on exploiting the syntactic information, the semantic information and both. Research on cognition theory reveals that the syntax an*/874d the semantics have effects on each other. In this work, a graph convolutional network-based model that fuses the syntactic information and semantic information in line with the cognitive practice is proposed. To start with, the GCN is taken to extract syntactic information on the syntax dependency tree. Then, the semantic graph is constructed via a multi-head self-attention mechanism and encoded by GCN. Furthermore, a parameter-sharing GCN is developed to capture the common information between the semantics and the syntax. Experiments conducted on three benchmark datasets (Laptop14, Restaurant14 and Twitter) validate that the proposed model achieves compelling performance comparing with the state-of-the-art models.
基于方面的情感分析(ABSA)旨在对句子或文档中给定方面的情感极性进行分类,是自然语言处理中的一项细粒度任务。目前的ABSA方法主要集中在句法信息和语义信息的开发上。认知理论研究表明,句法和语义是相互影响的。本文提出了一种符合认知实践的基于图卷积网络的句法信息和语义信息融合模型。首先,使用GCN提取语法依赖树上的语法信息。然后,通过多头自关注机制构建语义图,并进行GCN编码。在此基础上,提出了一种参数共享GCN,用于捕获语义和语法之间的公共信息。在三个基准数据集(Laptop14, Restaurant14和Twitter)上进行的实验验证了所提出的模型与最先进的模型相比取得了令人信服的性能。
{"title":"Fusing Syntax and Semantics-Based Graph Convolutional Network for Aspect-Based Sentiment Analysis","authors":"Jinhui Feng, Shaohua Cai, Kuntao Li, Yifan Chen, Qianhua Cai, Hongya Zhao","doi":"10.4018/ijdwm.319803","DOIUrl":"https://doi.org/10.4018/ijdwm.319803","url":null,"abstract":"Aspect-based sentiment analysis (ABSA) aims to classify the sentiment polarity of a given aspect in a sentence or document, which is a fine-grained task of natural language processing. Recent ABSA methods mainly focus on exploiting the syntactic information, the semantic information and both. Research on cognition theory reveals that the syntax an*/874d the semantics have effects on each other. In this work, a graph convolutional network-based model that fuses the syntactic information and semantic information in line with the cognitive practice is proposed. To start with, the GCN is taken to extract syntactic information on the syntax dependency tree. Then, the semantic graph is constructed via a multi-head self-attention mechanism and encoded by GCN. Furthermore, a parameter-sharing GCN is developed to capture the common information between the semantics and the syntax. Experiments conducted on three benchmark datasets (Laptop14, Restaurant14 and Twitter) validate that the proposed model achieves compelling performance comparing with the state-of-the-art models.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":"22 1","pages":"1-15"},"PeriodicalIF":1.2,"publicationDate":"2023-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85826520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CTNRL: A Novel Network Representation Learning With Three Feature Integrations CTNRL:一种新颖的三特征集成网络表示学习方法
IF 1.2 4区 计算机科学 Q4 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2023-03-03 DOI: 10.4018/ijdwm.318696
Yanlong Tang, Zhonglin Ye, Haixing Zhao, Yi Ji
Network representation learning is one of the important works of analyzing network information. Its purpose is to learn a vector for each node in the network and map it into the vector space, and the resulting number of node dimensions is much smaller than the number of nodes in the network. Most of the current work only considers local features and ignores other features in the network, such as attribute features. Aiming at such problems, this paper proposes novel mechanisms of combining network topology, which models node text information and node clustering information on the basis of network structure and then constrains the learning process of network representation to obtain the optimal network node vector. The method is experimentally verified on three datasets: Citeseer (M10), DBLP (V4), and SDBLP. Experimental results show that the proposed method is better than the algorithm based on network topology and text feature. Good experimental results are obtained, which verifies the feasibility of the algorithm and achieves the expected experimental results.
网络表示学习是网络信息分析的重要工作之一。其目的是为网络中的每个节点学习一个向量,并将其映射到向量空间中,得到的节点维数远远小于网络中的节点数。目前的工作大多只考虑局部特征,而忽略了网络中的其他特征,如属性特征。针对这些问题,本文提出了一种结合网络拓扑的新机制,基于网络结构对节点文本信息和节点聚类信息进行建模,然后约束网络表示的学习过程,以获得最优的网络节点向量。在Citeseer (M10)、DBLP (V4)和SDBLP三个数据集上对该方法进行了实验验证。实验结果表明,该方法优于基于网络拓扑和文本特征的算法。得到了良好的实验结果,验证了该算法的可行性,达到了预期的实验结果。
{"title":"CTNRL: A Novel Network Representation Learning With Three Feature Integrations","authors":"Yanlong Tang, Zhonglin Ye, Haixing Zhao, Yi Ji","doi":"10.4018/ijdwm.318696","DOIUrl":"https://doi.org/10.4018/ijdwm.318696","url":null,"abstract":"Network representation learning is one of the important works of analyzing network information. Its purpose is to learn a vector for each node in the network and map it into the vector space, and the resulting number of node dimensions is much smaller than the number of nodes in the network. Most of the current work only considers local features and ignores other features in the network, such as attribute features. Aiming at such problems, this paper proposes novel mechanisms of combining network topology, which models node text information and node clustering information on the basis of network structure and then constrains the learning process of network representation to obtain the optimal network node vector. The method is experimentally verified on three datasets: Citeseer (M10), DBLP (V4), and SDBLP. Experimental results show that the proposed method is better than the algorithm based on network topology and text feature. Good experimental results are obtained, which verifies the feasibility of the algorithm and achieves the expected experimental results.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":"19 1","pages":"1-14"},"PeriodicalIF":1.2,"publicationDate":"2023-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70455633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Research on Rumor Detection Based on a Graph Attention Network With Temporal Features 基于时间特征的图注意网络谣言检测研究
IF 1.2 4区 计算机科学 Q4 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2023-03-02 DOI: 10.4018/ijdwm.319342
Xiaohui Yang, Hailong Ma, Miao Wang
The higher-order and temporal characteristics of tweet sequences are often ignored in the field of rumor detection. In this paper, a new rumor detection method (T-BiGAT) is proposed to capture the temporal features between tweets by combining a graph attention network (GAT) and gated recurrent neural network (GRU). First, timestamps are calculated for each tweet within the same event. On the premise of the same timestamp, two different propagation subgraphs are constructed according to the response relationship between tweets. Then, GRU is used to capture intralayer dependencies between sibling nodes in the subtree; global features of each subtree are extracted using an improved GAT. Furthermore, GRU is reused to capture the temporal dependencies of individual subgraphs at different timestamps. Finally, weights are assigned to the global feature vectors of different timestamp subtrees for aggregation, and a mapping function is used to classify the aggregated vectors.
在谣言检测领域,推文序列的高阶特征和时间特征往往被忽略。本文提出了一种新的谣言检测方法(T-BiGAT),该方法将图注意网络(GAT)和门控递归神经网络(GRU)相结合,捕捉推文之间的时间特征。首先,为同一事件中的每条tweet计算时间戳。在时间戳相同的前提下,根据tweets之间的响应关系构造两个不同的传播子图。然后,使用GRU捕获子树中兄弟节点之间的层内依赖关系;使用改进的GAT提取每个子树的全局特征。此外,可以重用GRU来捕获各个子图在不同时间戳上的时间依赖性。最后,对不同时间戳子树的全局特征向量赋权进行聚合,并使用映射函数对聚合向量进行分类。
{"title":"Research on Rumor Detection Based on a Graph Attention Network With Temporal Features","authors":"Xiaohui Yang, Hailong Ma, Miao Wang","doi":"10.4018/ijdwm.319342","DOIUrl":"https://doi.org/10.4018/ijdwm.319342","url":null,"abstract":"The higher-order and temporal characteristics of tweet sequences are often ignored in the field of rumor detection. In this paper, a new rumor detection method (T-BiGAT) is proposed to capture the temporal features between tweets by combining a graph attention network (GAT) and gated recurrent neural network (GRU). First, timestamps are calculated for each tweet within the same event. On the premise of the same timestamp, two different propagation subgraphs are constructed according to the response relationship between tweets. Then, GRU is used to capture intralayer dependencies between sibling nodes in the subtree; global features of each subtree are extracted using an improved GAT. Furthermore, GRU is reused to capture the temporal dependencies of individual subgraphs at different timestamps. Finally, weights are assigned to the global feature vectors of different timestamp subtrees for aggregation, and a mapping function is used to classify the aggregated vectors.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":"47 1","pages":"1-17"},"PeriodicalIF":1.2,"publicationDate":"2023-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86331233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Clustering of COVID-19 Multi-Time Series-Based K-Means and PCA With Forecasting 基于多时间序列的k均值聚类与PCA预测
IF 1.2 4区 计算机科学 Q4 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2023-02-03 DOI: 10.4018/ijdwm.317374
Sundus Naji Alaziz, Bakr Albayati, A. A. El-Bagoury, Wasswa Shafik
The COVID-19 pandemic is one of the current universal threats to humanity. The entire world is cooperating persistently to find some ways to decrease its effect. The time series is one of the basic criteria that play a fundamental part in developing an accurate prediction model for future estimations regarding the expansion of this virus with its infective nature. The authors discuss in this paper the goals of the study, problems, definitions, and previous studies. Also they deal with the theoretical aspect of multi-time series clusters using both the K-means and the time series cluster. In the end, they apply the topics, and ARIMA is used to introduce a prototype to give specific predictions about the impact of the COVID-19 pandemic from 90 to 140 days. The modeling and prediction process is done using the available data set from the Saudi Ministry of Health for Riyadh, Jeddah, Makkah, and Dammam during the previous four months, and the model is evaluated using the Python program. Based on this proposed method, the authors address the conclusions.
新冠肺炎大流行是当前人类面临的普遍威胁之一。整个世界都在坚持不懈地进行合作,以找到一些减少其影响的方法。时间序列是基本标准之一,在为今后估计这种具有传染性的病毒的扩散情况建立准确的预测模型方面发挥着重要作用。本文讨论了研究的目的、问题、定义和以往的研究。他们还使用k均值和时间序列聚类处理多时间序列聚类的理论方面。最后,他们应用这些主题,并使用ARIMA引入一个原型,对COVID-19大流行在90至140天内的影响进行具体预测。建模和预测过程使用沙特卫生部提供的利雅得、吉达、麦加和达曼过去四个月的可用数据集,并使用Python程序对模型进行评估。在此基础上,作者对结论进行了说明。
{"title":"Clustering of COVID-19 Multi-Time Series-Based K-Means and PCA With Forecasting","authors":"Sundus Naji Alaziz, Bakr Albayati, A. A. El-Bagoury, Wasswa Shafik","doi":"10.4018/ijdwm.317374","DOIUrl":"https://doi.org/10.4018/ijdwm.317374","url":null,"abstract":"The COVID-19 pandemic is one of the current universal threats to humanity. The entire world is cooperating persistently to find some ways to decrease its effect. The time series is one of the basic criteria that play a fundamental part in developing an accurate prediction model for future estimations regarding the expansion of this virus with its infective nature. The authors discuss in this paper the goals of the study, problems, definitions, and previous studies. Also they deal with the theoretical aspect of multi-time series clusters using both the K-means and the time series cluster. In the end, they apply the topics, and ARIMA is used to introduce a prototype to give specific predictions about the impact of the COVID-19 pandemic from 90 to 140 days. The modeling and prediction process is done using the available data set from the Saudi Ministry of Health for Riyadh, Jeddah, Makkah, and Dammam during the previous four months, and the model is evaluated using the Python program. Based on this proposed method, the authors address the conclusions.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":"3 1","pages":"1-25"},"PeriodicalIF":1.2,"publicationDate":"2023-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90155954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Combining BPSO and ELM Models for Inferring Novel lncRNA-Disease Associations 结合BPSO和ELM模型推断新的lncrna -疾病关联
IF 1.2 4区 计算机科学 Q4 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2023-01-20 DOI: 10.4018/ijdwm.317092
W. Yang, Xianghan Zheng, Qiongxia Huang, Yu Liu, Yimi Chen, ZhiGang Song
It has been widely known that long non-coding RNA (lncRNA) plays an important role in gene expression and regulation. However, due to a few characteristics of lncRNA (e.g., huge amounts of data, high dimension, lack of noted samples, etc.), identifying key lncRNA closely related to specific disease is nearly impossible. In this paper, the authors propose a computational method to predict key lncRNA closely related to its corresponding disease. The proposed solution implements a BPSO based intelligent algorithm to select possible optimal lncRNA subset, and then uses ML-ELM based deep learning model to evaluate each lncRNA subset. After that, wrapper feature extraction method is used to select lncRNAs, which are closely related to the pathophysiology of disease from massive data. Experimentation on three typical open datasets proves the feasibility and efficiency of our proposed solution. This proposed solution achieves above 93% accuracy, the best ever.
众所周知,长链非编码RNA (long non-coding RNA, lncRNA)在基因表达和调控中发挥着重要作用。然而,由于lncRNA的一些特点(如数据量大、维度高、缺少值得注意的样本等),识别与特定疾病密切相关的关键lncRNA几乎是不可能的。本文提出了一种预测与其相应疾病密切相关的关键lncRNA的计算方法。该方案采用基于BPSO的智能算法选择可能的最优lncRNA子集,然后使用基于ML-ELM的深度学习模型对每个lncRNA子集进行评估。然后,采用包装特征提取方法,从海量数据中选择与疾病病理生理密切相关的lncrna。在三个典型开放数据集上的实验证明了该方法的可行性和有效性。该解决方案的准确率达到93%以上,是有史以来最好的。
{"title":"Combining BPSO and ELM Models for Inferring Novel lncRNA-Disease Associations","authors":"W. Yang, Xianghan Zheng, Qiongxia Huang, Yu Liu, Yimi Chen, ZhiGang Song","doi":"10.4018/ijdwm.317092","DOIUrl":"https://doi.org/10.4018/ijdwm.317092","url":null,"abstract":"It has been widely known that long non-coding RNA (lncRNA) plays an important role in gene expression and regulation. However, due to a few characteristics of lncRNA (e.g., huge amounts of data, high dimension, lack of noted samples, etc.), identifying key lncRNA closely related to specific disease is nearly impossible. In this paper, the authors propose a computational method to predict key lncRNA closely related to its corresponding disease. The proposed solution implements a BPSO based intelligent algorithm to select possible optimal lncRNA subset, and then uses ML-ELM based deep learning model to evaluate each lncRNA subset. After that, wrapper feature extraction method is used to select lncRNAs, which are closely related to the pathophysiology of disease from massive data. Experimentation on three typical open datasets proves the feasibility and efficiency of our proposed solution. This proposed solution achieves above 93% accuracy, the best ever.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":"14 1","pages":"1-18"},"PeriodicalIF":1.2,"publicationDate":"2023-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75810785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Spatiotemporal Data Prediction Model Based on a Multi-Layer Attention Mechanism 基于多层注意机制的时空数据预测模型
IF 1.2 4区 计算机科学 Q4 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2023-01-16 DOI: 10.4018/ijdwm.315822
Man Jiang, Qilong Han, Haitao Zhang, Hexiang Liu
Spatiotemporal data prediction is of great significance in the fields of smart cities and smart manufacturing. Current spatiotemporal data prediction models heavily rely on traditional spatial views or single temporal granularity, which suffer from missing knowledge, including dynamic spatial correlations, periodicity, and mutability. This paper addresses these challenges by proposing a multi-layer attention-based predictive model. The key idea of this paper is to use a multi-layer attention mechanism to model the dynamic spatial correlation of different features. Then, multi-granularity historical features are fused to predict future spatiotemporal data. Experiments on real-world data show that the proposed model outperforms six state-of-the-art benchmark methods.
时空数据预测在智慧城市、智能制造等领域具有重要意义。当前的时空数据预测模型严重依赖于传统的空间视图或单一的时间粒度,存在空间动态相关性、周期性和可变性等知识缺失问题。本文通过提出一个多层的基于注意力的预测模型来解决这些挑战。本文的核心思想是利用多层注意机制对不同特征之间的动态空间关联进行建模。然后,融合多粒度历史特征对未来时空数据进行预测。在实际数据上的实验表明,所提出的模型优于六种最先进的基准方法。
{"title":"Spatiotemporal Data Prediction Model Based on a Multi-Layer Attention Mechanism","authors":"Man Jiang, Qilong Han, Haitao Zhang, Hexiang Liu","doi":"10.4018/ijdwm.315822","DOIUrl":"https://doi.org/10.4018/ijdwm.315822","url":null,"abstract":"Spatiotemporal data prediction is of great significance in the fields of smart cities and smart manufacturing. Current spatiotemporal data prediction models heavily rely on traditional spatial views or single temporal granularity, which suffer from missing knowledge, including dynamic spatial correlations, periodicity, and mutability. This paper addresses these challenges by proposing a multi-layer attention-based predictive model. The key idea of this paper is to use a multi-layer attention mechanism to model the dynamic spatial correlation of different features. Then, multi-granularity historical features are fused to predict future spatiotemporal data. Experiments on real-world data show that the proposed model outperforms six state-of-the-art benchmark methods.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":"8 1","pages":"1-15"},"PeriodicalIF":1.2,"publicationDate":"2023-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82940997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A New Outlier Detection Algorithm Based on Fast Density Peak Clustering Outlier Factor 基于快速密度峰聚类离群因子的离群点检测新算法
IF 1.2 4区 计算机科学 Q4 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2023-01-13 DOI: 10.4018/ijdwm.316534
Zhongping Zhang, Sen Li, Weixiong Liu, Y. Wang, Daisy Xin Li
Outlier detection is an important field in data mining, which can be used in fraud detection, fault detection, and other fields. This article focuses on the problem where the density peak clustering algorithm needs a manual parameter setting and time complexity is high; the first is to use the k nearest neighbors clustering algorithm to replace the density peak of the density estimate, which adopts the KD-Tree index data structure calculation of data objects k close neighbors. Then it adopts the method of the product of density and distance automatic selection of clustering centers. In addition, the central relative distance and fast density peak clustering outliers were defined to characterize the degree of outliers of data objects. Then, based on fast density peak clustering outliers, an outlier detection algorithm was devised. Experiments on artificial and real data sets are performed to validate the algorithm, and the validity and time efficiency of the proposed algorithm are validated when compared to several conventional and innovative algorithms.
离群点检测是数据挖掘中的一个重要领域,可用于欺诈检测、故障检测等领域。本文主要研究了密度峰聚类算法需要手动设置参数和时间复杂度高的问题;第一种是使用k近邻聚类算法替换密度峰值的密度估计,其中采用KD-Tree索引数据结构计算数据对象k近邻。然后采用密度与距离乘积的方法自动选择聚类中心;此外,还定义了中心相对距离和快速密度峰聚类离群点来表征数据对象离群点的程度。然后,基于快速密度峰聚类离群点,设计了一种离群点检测算法。在人工数据集和真实数据集上进行了实验验证,并与几种传统算法和创新算法进行了对比,验证了该算法的有效性和时间效率。
{"title":"A New Outlier Detection Algorithm Based on Fast Density Peak Clustering Outlier Factor","authors":"Zhongping Zhang, Sen Li, Weixiong Liu, Y. Wang, Daisy Xin Li","doi":"10.4018/ijdwm.316534","DOIUrl":"https://doi.org/10.4018/ijdwm.316534","url":null,"abstract":"Outlier detection is an important field in data mining, which can be used in fraud detection, fault detection, and other fields. This article focuses on the problem where the density peak clustering algorithm needs a manual parameter setting and time complexity is high; the first is to use the k nearest neighbors clustering algorithm to replace the density peak of the density estimate, which adopts the KD-Tree index data structure calculation of data objects k close neighbors. Then it adopts the method of the product of density and distance automatic selection of clustering centers. In addition, the central relative distance and fast density peak clustering outliers were defined to characterize the degree of outliers of data objects. Then, based on fast density peak clustering outliers, an outlier detection algorithm was devised. Experiments on artificial and real data sets are performed to validate the algorithm, and the validity and time efficiency of the proposed algorithm are validated when compared to several conventional and innovative algorithms.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":"21 1","pages":"1-19"},"PeriodicalIF":1.2,"publicationDate":"2023-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78453444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Estimating the Number of Clusters in High-Dimensional Large Datasets 高维大型数据集中聚类数量的估计
IF 1.2 4区 计算机科学 Q4 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2023-01-13 DOI: 10.4018/ijdwm.316142
Xutong Zhu, Lingli Li
Clustering is a basic primer of exploratory tasks. In order to obtain valuable results, the parameters in the clustering algorithm, the number of clusters must be set appropriately. Existing methods for determining the number of clusters perform well on low-dimensional small datasets, but how to effectively determine the optimal number of clusters on large high-dimensional datasets is still a challenging problem. In this paper, the authors design a method for effectively estimating the optimal number of clusters on large-scale high-dimensional datasets that can overcome the shortcomings of existing estimation methods and accurately and quickly estimate the optimal number of clusters on large-scale high-dimensional datasets. Extensive experiments show that it (1) outperforms existing estimation methods in accuracy and efficiency, (2) generalizes across different datasets, and (3) is suitable for high-dimensional large datasets.
聚类是探索性任务的基本入门。为了获得有价值的结果,聚类算法中的参数、聚类数量必须设置得当。现有的聚类数量确定方法在低维小数据集上表现良好,但如何在大型高维数据集上有效确定最优聚类数量仍然是一个具有挑战性的问题。本文设计了一种有效估计大规模高维数据集上最优聚类数的方法,克服了现有估计方法的不足,能够准确、快速地估计大规模高维数据集上的最优聚类数。大量的实验表明:(1)在精度和效率上优于现有的估计方法;(2)在不同的数据集上泛化;(3)适用于高维大数据集。
{"title":"Estimating the Number of Clusters in High-Dimensional Large Datasets","authors":"Xutong Zhu, Lingli Li","doi":"10.4018/ijdwm.316142","DOIUrl":"https://doi.org/10.4018/ijdwm.316142","url":null,"abstract":"Clustering is a basic primer of exploratory tasks. In order to obtain valuable results, the parameters in the clustering algorithm, the number of clusters must be set appropriately. Existing methods for determining the number of clusters perform well on low-dimensional small datasets, but how to effectively determine the optimal number of clusters on large high-dimensional datasets is still a challenging problem. In this paper, the authors design a method for effectively estimating the optimal number of clusters on large-scale high-dimensional datasets that can overcome the shortcomings of existing estimation methods and accurately and quickly estimate the optimal number of clusters on large-scale high-dimensional datasets. Extensive experiments show that it (1) outperforms existing estimation methods in accuracy and efficiency, (2) generalizes across different datasets, and (3) is suitable for high-dimensional large datasets.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":"42 1","pages":"1-14"},"PeriodicalIF":1.2,"publicationDate":"2023-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88955315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Ensemble Approach for Prediction of Cardiovascular Disease Using Meta Classifier Boosting Algorithms 基于Meta-Classifier-Boosting算法的心血管疾病综合预测方法
IF 1.2 4区 计算机科学 Q4 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2023-01-13 DOI: 10.4018/ijdwm.316145
Sibo Prasad Patro, Neelamadhab Padhy, Rahul Deo Sah
There are very few studies are carried for investigating the potential of hybrid ensemble machine learning techniques for building a model for the detection and prediction of heart disease in the human body. In this research, the authors deal with a classification problem that is a hybridization of fusion-based ensemble model with machine learning approaches, which produces a more trustworthy ensemble than the original ensemble model and outperforms previous heart disease prediction models. The proposed model is evaluated on the Cleveland heart disease dataset using six boosting techniques named XGBoost, AdaBoost, Gradient Boosting, LightGBM, CatBoost, and Histogram-Based Gradient Boosting. Hybridization produces superior results under consideration of classification algorithms. The remarkable accuracies of 96.51% for training and 93.37% for testing have been achieved by the Meta-XGBoost classifier.
很少有研究调查混合集成机器学习技术在建立人体心脏病检测和预测模型方面的潜力。在这项研究中,作者处理了一个分类问题,该问题是基于融合的集成模型与机器学习方法的混合,它产生了比原始集成模型更值得信赖的集成,并且优于以前的心脏病预测模型。所提出的模型在克利夫兰心脏病数据集上使用六种增强技术进行评估,分别为XGBoost、AdaBoost、梯度增强、LightGBM、CatBoost和基于直方图的梯度增强。在考虑分类算法的情况下,杂交产生了优越的结果。Meta XGBoost分类器的训练准确率为96.51%,测试准确率为93.37%。
{"title":"An Ensemble Approach for Prediction of Cardiovascular Disease Using Meta Classifier Boosting Algorithms","authors":"Sibo Prasad Patro, Neelamadhab Padhy, Rahul Deo Sah","doi":"10.4018/ijdwm.316145","DOIUrl":"https://doi.org/10.4018/ijdwm.316145","url":null,"abstract":"There are very few studies are carried for investigating the potential of hybrid ensemble machine learning techniques for building a model for the detection and prediction of heart disease in the human body. In this research, the authors deal with a classification problem that is a hybridization of fusion-based ensemble model with machine learning approaches, which produces a more trustworthy ensemble than the original ensemble model and outperforms previous heart disease prediction models. The proposed model is evaluated on the Cleveland heart disease dataset using six boosting techniques named XGBoost, AdaBoost, Gradient Boosting, LightGBM, CatBoost, and Histogram-Based Gradient Boosting. Hybridization produces superior results under consideration of classification algorithms. The remarkable accuracies of 96.51% for training and 93.37% for testing have been achieved by the Meta-XGBoost classifier.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":" ","pages":""},"PeriodicalIF":1.2,"publicationDate":"2023-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45477411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Efficient Association Rule Mining-Based Spatial Keyword Index 基于关联规则挖掘的高效空间关键字索引
IF 1.2 4区 计算机科学 Q4 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2023-01-13 DOI: 10.4018/ijdwm.316161
Lianyin Jia, Haotian Tang, Mengjuan Li, Bingxin Zhao, S. Wei, Haihe Zhou
Spatial keyword query has attracted the attention of many researchers. Most of the existing spatial keyword indexes do not consider the differences in keyword distribution, so their efficiencies are not high when data are skewed. To this end, this paper proposes a novel association rule mining based spatial keyword index, ARM-SQ, whose inverted lists are materialized by the frequent item sets mined by association rules; thus, intersections of long lists can be avoided. To prevent excessive space costs caused by materialization, a depth-based materialization strategy is introduced, which maintains a good balance between query and space costs. To select the right frequent item sets for answering a query, the authors further implement a benefit-based greedy frequent item set selection algorithm, BGF-Selection. The experimental results show that this algorithm significantly outperforms the existing algorithms, and its efficiency can be an order of magnitude higher than SFC-Quad.
空间关键字查询引起了众多研究者的关注。现有的空间关键字索引大多没有考虑关键字分布的差异性,在数据偏斜时效率不高。为此,本文提出了一种新的基于关联规则挖掘的空间关键字索引ARM-SQ,该索引的倒排表由关联规则挖掘的频繁项集具体化;因此,可以避免长列表的交叉。为了防止物化造成过多的空间成本,引入了一种基于深度的物化策略,在查询成本和空间成本之间保持了良好的平衡。为了选择正确的频繁项集来回答查询,作者进一步实现了一种基于利益的贪婪频繁项集选择算法BGF-Selection。实验结果表明,该算法明显优于现有算法,其效率比SFC-Quad高出一个数量级。
{"title":"An Efficient Association Rule Mining-Based Spatial Keyword Index","authors":"Lianyin Jia, Haotian Tang, Mengjuan Li, Bingxin Zhao, S. Wei, Haihe Zhou","doi":"10.4018/ijdwm.316161","DOIUrl":"https://doi.org/10.4018/ijdwm.316161","url":null,"abstract":"Spatial keyword query has attracted the attention of many researchers. Most of the existing spatial keyword indexes do not consider the differences in keyword distribution, so their efficiencies are not high when data are skewed. To this end, this paper proposes a novel association rule mining based spatial keyword index, ARM-SQ, whose inverted lists are materialized by the frequent item sets mined by association rules; thus, intersections of long lists can be avoided. To prevent excessive space costs caused by materialization, a depth-based materialization strategy is introduced, which maintains a good balance between query and space costs. To select the right frequent item sets for answering a query, the authors further implement a benefit-based greedy frequent item set selection algorithm, BGF-Selection. The experimental results show that this algorithm significantly outperforms the existing algorithms, and its efficiency can be an order of magnitude higher than SFC-Quad.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":"32 1","pages":"1-19"},"PeriodicalIF":1.2,"publicationDate":"2023-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88182495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
International Journal of Data Warehousing and Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1