IEEE Transactions on Knowledge and Data Engineering最新文献_第8页

Clustering Ensemble Based on Fuzzy Matrix Self-Enhancement 基于模糊矩阵自增强的聚类组合

IF 8.9 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering

Pub Date : 2024-10-31 DOI: 10.1109/TKDE.2024.3489553

Xia Ji;Jiawei Sun;Jianhua Peng;Yue Pang;Peng Zhou

Fuzzy clustering ensemble techniques have been proven to yield more accurate and robust clustering results, with the mainstream methods relying on the fuzzy co-association (FCA) matrix. However, the inherent issues of low-value density and uniform dispersion in the FCA matrix significantly affect the performance of fuzzy clustering ensembles, an aspect that has been overlooked. To address this issue, we propose a novel framework for fuzzy clustering ensemble based on fuzzy matrix self-enhancement (FMSE). Specifically, we initially employ singular value decomposition to extract the principal components of the FCA matrix, thereby alleviating its low-value density. Second, on the basis of the criterion of fuzzy entropy, we measure the fuzziness of samples, design a metric for the fuzzy representativeness of samples, and incorporate it into a fusion-weighted structure for the reconstruction of the FCA matrix, mitigating uniform dispersion. Subsequently, on the basis of the self-enhanced fuzzy matrix model, we utilize a prototype diffusion approach to identify core samples and gradually allocate remaining samples to obtain a consensus clustering solution. Extensive comparative experiments on benchmark datasets against state-of-the-art clustering ensemble methods demonstrate the effectiveness and superiority of the proposed approach.

模糊聚类集成技术已被证明可以产生更准确和鲁棒的聚类结果，而主流方法依赖于模糊协关联（FCA）矩阵。然而，FCA矩阵中固有的低值密度和均匀分散问题严重影响了模糊聚类集成的性能，这是一个被忽视的方面。为了解决这一问题，我们提出了一种基于模糊矩阵自增强的模糊聚类集成框架。具体来说，我们最初采用奇异值分解来提取FCA矩阵的主成分，从而减轻其低值密度。其次，在模糊熵准则的基础上，对样本的模糊性进行度量，设计样本的模糊代表性度量，并将其融入到FCA矩阵的融合加权结构中，用于重构FCA矩阵，减轻均匀色散。随后，在自增强模糊矩阵模型的基础上，利用原型扩散方法识别核心样本，并逐步分配剩余样本，得到一致聚类解。在基准数据集上与最先进的聚类集成方法进行了大量的对比实验，证明了该方法的有效性和优越性。

{"title":"Clustering Ensemble Based on Fuzzy Matrix Self-Enhancement","authors":"Xia Ji;Jiawei Sun;Jianhua Peng;Yue Pang;Peng Zhou","doi":"10.1109/TKDE.2024.3489553","DOIUrl":"https://doi.org/10.1109/TKDE.2024.3489553","url":null,"abstract":"Fuzzy clustering ensemble techniques have been proven to yield more accurate and robust clustering results, with the mainstream methods relying on the fuzzy co-association (FCA) matrix. However, the inherent issues of low-value density and uniform dispersion in the FCA matrix significantly affect the performance of fuzzy clustering ensembles, an aspect that has been overlooked. To address this issue, we propose a novel framework for fuzzy clustering ensemble based on fuzzy matrix self-enhancement (FMSE). Specifically, we initially employ singular value decomposition to extract the principal components of the FCA matrix, thereby alleviating its low-value density. Second, on the basis of the criterion of fuzzy entropy, we measure the fuzziness of samples, design a metric for the fuzzy representativeness of samples, and incorporate it into a fusion-weighted structure for the reconstruction of the FCA matrix, mitigating uniform dispersion. Subsequently, on the basis of the self-enhanced fuzzy matrix model, we utilize a prototype diffusion approach to identify core samples and gradually allocate remaining samples to obtain a consensus clustering solution. Extensive comparative experiments on benchmark datasets against state-of-the-art clustering ensemble methods demonstrate the effectiveness and superiority of the proposed approach.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 1","pages":"148-161"},"PeriodicalIF":8.9,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142797916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Temporal Insights for Group-Based Fraud Detection on e-Commerce Platforms 电子商务平台上基于群体的欺诈检测的时间洞察

IF 8.9 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering

Pub Date : 2024-10-31 DOI: 10.1109/TKDE.2024.3485127

Jianke Yu;Hanchen Wang;Xiaoyang Wang;Zhao Li;Lu Qin;Wenjie Zhang;Jian Liao;Ying Zhang;Bailin Yang

Along with the rapid technological and commercial innovation on e-commerce platforms, an increasing number of frauds cause great harm to these platforms. Many frauds are conducted by organized groups of fraudsters for higher efficiency and lower costs, also known as group-based frauds. Despite the high concealment and strong destructiveness of group-based fraud, no existing research can thoroughly exploit the information within the transaction networks of e-commerce platforms for group-based fraud detection. In this work, we analyze and summarize the characteristics of group-based frauds. Based on this, we propose a novel end-to-end semi-supervised Group-based Fraud Detection Network (GFDN) to support such fraud detection in real-world applications. In addition, we introduce a module named Temporal Group Dynamics Analyzer (TGDA) that strengthens the ability to analyze temporal information on group fraudulent activity. Based on this, we built an enhanced model named TGFDN. Experimental results on large-scale e-commerce datasets from Taobao and Bitcoin trading datasets show our proposed model's superior effectiveness and efficiency for group-based fraud detection on bipartite graphs.

随着电子商务平台技术和商业创新的快速发展，越来越多的欺诈行为给电子商务平台带来了极大的危害。许多欺诈是由有组织的欺诈者团体进行的，以提高效率和降低成本，也称为群体欺诈。尽管群体欺诈具有较高的隐蔽性和较强的破坏性，但目前还没有研究能够深入挖掘电子商务平台交易网络中的信息进行群体欺诈检测。在这项工作中，我们分析和总结了基于群体的欺诈的特征。基于此，我们提出了一种新颖的端到端半监督的基于组的欺诈检测网络（GFDN），以支持这种欺诈检测在现实世界中的应用。此外，我们还引入了一个名为时间群体动态分析器（TGDA）的模块，增强了对群体欺诈活动的时间信息的分析能力。在此基础上，我们构建了一个增强模型，命名为TGFDN。在淘宝和比特币交易等大型电子商务数据集上的实验结果表明，该模型在二部图上基于群体的欺诈检测方面具有优异的有效性和效率。

{"title":"Temporal Insights for Group-Based Fraud Detection on e-Commerce Platforms","authors":"Jianke Yu;Hanchen Wang;Xiaoyang Wang;Zhao Li;Lu Qin;Wenjie Zhang;Jian Liao;Ying Zhang;Bailin Yang","doi":"10.1109/TKDE.2024.3485127","DOIUrl":"https://doi.org/10.1109/TKDE.2024.3485127","url":null,"abstract":"Along with the rapid technological and commercial innovation on e-commerce platforms, an increasing number of frauds cause great harm to these platforms. Many frauds are conducted by organized groups of fraudsters for higher efficiency and lower costs, also known as group-based frauds. Despite the high concealment and strong destructiveness of group-based fraud, no existing research can thoroughly exploit the information within the transaction networks of e-commerce platforms for group-based fraud detection. In this work, we analyze and summarize the characteristics of group-based frauds. Based on this, we propose a novel end-to-end semi-supervised Group-based Fraud Detection Network (GFDN) to support such fraud detection in real-world applications. In addition, we introduce a module named \u0000<italic>Temporal Group Dynamics Analyzer\u0000 (TGDA) that strengthens the ability to analyze temporal information on group fraudulent activity. Based on this, we built an enhanced model named TGFDN. Experimental results on large-scale e-commerce datasets from Taobao and Bitcoin trading datasets show our proposed model's superior effectiveness and efficiency for group-based fraud detection on bipartite graphs.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 2","pages":"951-965"},"PeriodicalIF":8.9,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142940800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Survey and Benchmark of Anomaly Detection in Business Processes 业务流程中异常检测的综述与基准

IF 8.9 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering

Pub Date : 2024-10-30 DOI: 10.1109/TKDE.2024.3484159

Wei Guan;Jian Cao;Haiyan Zhao;Yang Gu;Shiyou Qian

Effective management of business processes is crucial for organizational success. However, despite meticulous design and implementation, anomalies are inevitable and can result in inefficiencies, delays, or even significant financial losses. Numerous methods for detecting anomalies in business processes have been proposed recently. However, there is no comprehensive benchmark to evaluate these methods. Consequently, the relative merits of each method remain unclear due to differences in their experimental setup, choice of datasets and evaluation measures. In this paper, we present a systematic literature review and taxonomy of business process anomaly detection methods. Additionally, we select at least one method from each category, resulting in 16 methods that are cross-benchmarked against 32 synthetic logs and 19 real-life logs from different industry domains. Our analysis provides insights into the strengths and weaknesses of different anomaly detection methods. Ultimately, our findings can help researchers and practitioners in the field of process mining make informed decisions when selecting and applying anomaly detection methods to real-life business scenarios. Finally, some future directions are discussed in order to promote the evolution of business process anomaly detection.

业务流程的有效管理对于组织的成功至关重要。然而，尽管精心设计和实施，异常是不可避免的，并可能导致效率低下，延迟，甚至重大的经济损失。最近提出了许多检测业务流程异常的方法。然而，目前还没有一个综合的基准来评价这些方法。因此，由于实验设置、数据集选择和评估措施的差异，每种方法的相对优点仍然不清楚。本文对业务流程异常检测方法进行了系统的文献综述和分类。此外，我们从每个类别中选择至少一种方法，从而产生16种方法，这些方法与来自不同行业领域的32种合成日志和19种实际日志进行交叉基准测试。我们的分析揭示了不同异常检测方法的优缺点。最终，我们的发现可以帮助过程挖掘领域的研究人员和从业者在选择和应用异常检测方法到现实生活中的业务场景时做出明智的决策。最后，对未来的发展方向进行了展望，以期促进业务流程异常检测技术的发展。

{"title":"Survey and Benchmark of Anomaly Detection in Business Processes","authors":"Wei Guan;Jian Cao;Haiyan Zhao;Yang Gu;Shiyou Qian","doi":"10.1109/TKDE.2024.3484159","DOIUrl":"https://doi.org/10.1109/TKDE.2024.3484159","url":null,"abstract":"Effective management of business processes is crucial for organizational success. However, despite meticulous design and implementation, anomalies are inevitable and can result in inefficiencies, delays, or even significant financial losses. Numerous methods for detecting anomalies in business processes have been proposed recently. However, there is no comprehensive benchmark to evaluate these methods. Consequently, the relative merits of each method remain unclear due to differences in their experimental setup, choice of datasets and evaluation measures. In this paper, we present a systematic literature review and taxonomy of business process anomaly detection methods. Additionally, we select at least one method from each category, resulting in 16 methods that are cross-benchmarked against 32 synthetic logs and 19 real-life logs from different industry domains. Our analysis provides insights into the strengths and weaknesses of different anomaly detection methods. Ultimately, our findings can help researchers and practitioners in the field of process mining make informed decisions when selecting and applying anomaly detection methods to real-life business scenarios. Finally, some future directions are discussed in order to promote the evolution of business process anomaly detection.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 1","pages":"493-512"},"PeriodicalIF":8.9,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ENCODE: Breaking the Trade-Off Between Performance and Efficiency in Long-Term User Behavior Modeling ENCODE：打破长期用户行为建模中性能与效率之间的折衷

IF 8.9 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering

Pub Date : 2024-10-30 DOI: 10.1109/TKDE.2024.3486445

Wen-Ji Zhou;Yuhang Zheng;Yinfu Feng;Yunan Ye;Rong Xiao;Long Chen;Xiaosong Yang;Jun Xiao

Long-term user behavior sequences are a goldmine for businesses to explore users’ interests to improve Click-Through Rate (CTR). However, it is very challenging to accurately capture users’ long-term interests from their long-term behavior sequences and give quick responses from the online serving systems. To meet such requirements, existing methods “inadvertently” destroy two basic requirements in long-term sequence modeling: R1) make full use of the entire sequence to keep the information as much as possible; R2) extract information from the most relevant behaviors to keep high relevance between learned interests and current target items. The performance of online serving systems is significantly affected by incomplete and inaccurate user interest information obtained by existing methods. To this end, we propose an efficient two-stage long-term sequence modeling approach, named as EfficieNt Clustering based twO-stage interest moDEling (ENCODE), consisting of offline extraction stage and online inference stage. It not only meets the aforementioned two basic requirements but also achieves a desirable balance between online service efficiency and precision. Specifically, in the offline extraction stage, ENCODE clusters the entire behavior sequence and extracts accurate interests. To reduce the overhead of the clustering process, we design a metric learning-based dimension reduction algorithm that preserves the relative pairwise distances of behaviors in the new feature space. While in the online inference stage, ENCODE takes the off-the-shelf user interests to predict the associations with target items. Besides, to further ensure the relevance between user interests and target items, we adopt the same relevance metric throughout the whole pipeline of ENCODE. The extensive experiment and comparison with SOTA on both industrial and public datasets have demonstrated the effectiveness and efficiency of our proposed ENCODE.

长期的用户行为序列是企业挖掘用户兴趣、提高点击率的金矿。然而，从用户的长期行为序列中准确捕捉用户的长期兴趣，并从在线服务系统中给出快速的响应是非常具有挑战性的。为了满足这种需求，现有方法在长期序列建模中“不经意”破坏了两个基本要求：R1)充分利用整个序列，尽可能多地保留信息；R2)从最相关的行为中提取信息，以保持学习兴趣与当前目标项目之间的高度相关性。现有方法获取的用户兴趣信息不完整、不准确，严重影响在线服务系统的性能。为此，我们提出了一种高效的两阶段长期序列建模方法，称为基于高效聚类的两阶段兴趣建模（ENCODE），包括离线提取阶段和在线推理阶段。它不仅满足了上述两个基本要求，而且在在线服务效率和精度之间达到了理想的平衡。具体来说，在离线提取阶段，ENCODE对整个行为序列进行聚类，提取出准确的兴趣。为了减少聚类过程的开销，我们设计了一种基于度量学习的降维算法，该算法保留了新特征空间中行为的相对成对距离。而在在线推理阶段，ENCODE利用现成的用户兴趣来预测与目标项目的关联。此外，为了进一步确保用户兴趣与目标项目之间的相关性，我们在ENCODE的整个管道中采用了相同的相关性度量。在工业和公共数据集上与SOTA进行了大量的实验和比较，证明了我们提出的ENCODE的有效性和效率。

{"title":"ENCODE: Breaking the Trade-Off Between Performance and Efficiency in Long-Term User Behavior Modeling","authors":"Wen-Ji Zhou;Yuhang Zheng;Yinfu Feng;Yunan Ye;Rong Xiao;Long Chen;Xiaosong Yang;Jun Xiao","doi":"10.1109/TKDE.2024.3486445","DOIUrl":"https://doi.org/10.1109/TKDE.2024.3486445","url":null,"abstract":"Long-term user behavior sequences are a goldmine for businesses to explore users’ interests to improve Click-Through Rate (CTR). However, it is very challenging to accurately capture users’ long-term interests from their long-term behavior sequences and give quick responses from the online serving systems. To meet such requirements, existing methods “inadvertently” destroy two basic requirements in long-term sequence modeling: \u0000<bold>R1\u0000) make full use of the entire sequence to keep the information as much as possible; \u0000<bold>R2\u0000) extract information from the most relevant behaviors to keep high relevance between learned interests and current target items. The performance of online serving systems is significantly affected by incomplete and inaccurate user interest information obtained by existing methods. To this end, we propose an efficient two-stage long-term sequence modeling approach, named as \u0000<bold>E\u0000fficie\u0000<bold>N\u0000t \u0000<bold>C\u0000lustering based tw\u0000<bold>O\u0000-stage interest mo\u0000<bold>DE\u0000ling (ENCODE), consisting of offline extraction stage and online inference stage. It not only meets the aforementioned two basic requirements but also achieves a desirable balance between online service efficiency and precision. Specifically, in the offline extraction stage, ENCODE clusters the entire behavior sequence and extracts accurate interests. To reduce the overhead of the clustering process, we design a metric learning-based dimension reduction algorithm that preserves the relative pairwise distances of behaviors in the new feature space. While in the online inference stage, ENCODE takes the off-the-shelf user interests to predict the associations with target items. Besides, to further ensure the relevance between user interests and target items, we adopt the same relevance metric throughout the whole pipeline of ENCODE. The extensive experiment and comparison with SOTA on both industrial and public datasets have demonstrated the effectiveness and efficiency of our proposed ENCODE.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 1","pages":"265-277"},"PeriodicalIF":8.9,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142797935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Answering Min-Max Resource-Constrained Shortest Path Queries Over Large Graphs 在大型图上回答资源受限的最短路径查询

IF 8.9 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering

Pub Date : 2024-10-30 DOI: 10.1109/TKDE.2024.3488095

Haoran Qian;Weiguo Zheng;Zhijie Zhang;Bo Fu

The constrained shortest path problem is a fundamental and challenging task in applications built on graphs. In this paper, we formalize and study the

$Min$

-

$Max$

resource-constrained shortest path (

$Min$

-

$Max$

RCSP) problem, which generalizes the well-studied

$Max$

RCSP problem. The objective is to find a simple path of minimum cost between two query nodes, subject to resource constraints between minimum and maximum limits. This problem has wide applications in fields such as delay networks and transportation. However, we theoretically prove that computing the optimal solution is NP-hard. We propose a two-stage approach that involves resource-based graph reduction followed by cost-guided path generation. To reduce the cost of expensive acyclicity checking, we introduce the technique of ancestor checking based on the shortest path tree. Furthermore, we present an even faster incremental search approach that considers both the path cost and resource constraints while avoiding acyclicity checking. Extensive experiments on twenty real graphs consistently demonstrate the superiority of our proposed methods, achieving up to two orders of magnitude improvement in time efficiency over the baseline algorithms while producing high-quality solutions.

在基于图的应用程序中，约束最短路径问题是一个基本且具有挑战性的任务。本文形式化并研究了$Min$-$Max$资源约束最短路径（$Min$-$Max$ RCSP）问题，它推广了已有的$Max$ RCSP问题。目标是在两个查询节点之间找到一条成本最小的简单路径，同时受最小和最大限制之间的资源约束。该问题在延迟网络和交通等领域有着广泛的应用。然而，我们从理论上证明了计算最优解是np困难的。我们提出了一种两阶段的方法，包括基于资源的图缩减，然后是成本导向的路径生成。为了减少昂贵的非循环检查成本，我们引入了基于最短路径树的祖先检查技术。此外，我们提出了一种更快的增量搜索方法，该方法考虑了路径成本和资源约束，同时避免了非循环检查。在20个真实图上进行的大量实验一致地证明了我们提出的方法的优越性，在产生高质量解决方案的同时，在时间效率上比基线算法提高了两个数量级。

{"title":"Answering Min-Max Resource-Constrained Shortest Path Queries Over Large Graphs","authors":"Haoran Qian;Weiguo Zheng;Zhijie Zhang;Bo Fu","doi":"10.1109/TKDE.2024.3488095","DOIUrl":"https://doi.org/10.1109/TKDE.2024.3488095","url":null,"abstract":"The constrained shortest path problem is a fundamental and challenging task in applications built on graphs. In this paper, we formalize and study the \u0000<inline-formula><tex-math>$Min$</tex-math></inline-formula>\u0000-\u0000<inline-formula><tex-math>$Max$</tex-math></inline-formula>\u0000 resource-constrained shortest path (\u0000<inline-formula><tex-math>$Min$</tex-math></inline-formula>\u0000-\u0000<inline-formula><tex-math>$Max$</tex-math></inline-formula>\u0000 RCSP) problem, which generalizes the well-studied \u0000<inline-formula><tex-math>$Max$</tex-math></inline-formula>\u0000 RCSP problem. The objective is to find a simple path of minimum cost between two query nodes, subject to resource constraints between minimum and maximum limits. This problem has wide applications in fields such as delay networks and transportation. However, we theoretically prove that computing the optimal solution is NP-hard. We propose a two-stage approach that involves resource-based graph reduction followed by cost-guided path generation. To reduce the cost of expensive acyclicity checking, we introduce the technique of ancestor checking based on the shortest path tree. Furthermore, we present an even faster incremental search approach that considers both the path cost and resource constraints while avoiding acyclicity checking. Extensive experiments on twenty real graphs consistently demonstrate the superiority of our proposed methods, achieving up to two orders of magnitude improvement in time efficiency over the baseline algorithms while producing high-quality solutions.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 1","pages":"60-74"},"PeriodicalIF":8.9,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142797961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DIMS: Distributed Index for Similarity Search in Metric Spaces 度量空间相似度搜索的分布式索引

IF 8.9 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering

Pub Date : 2024-10-29 DOI: 10.1109/TKDE.2024.3487759

Yifan Zhu;Chengyang Luo;Tang Qian;Lu Chen;Yunjun Gao;Baihua Zheng

Similarity search finds objects that are similar to a given query object based on a similarity metric. As the amount and variety of data continue to grow, similarity search in metric spaces has gained significant attention. Metric spaces can accommodate any type of data and support flexible distance metrics, making similarity search in metric spaces beneficial for many real-world applications, such as multimedia retrieval, personalized recommendation, trajectory analytics, data mining, decision planning, and distributed servers. However, existing studies mostly focus on indexing metric spaces on a single machine, which faces efficiency and scalability limitations with increasing data volume and query amount. Recent advancements in similarity search turn towards distributed methods, while they face challenges including inefficient local data management, unbalanced workload, and low concurrent search efficiency. To this end, we propose DIMS, an efficient Distributed Index for similarity search in Metric Spaces. First, we design a novel three-stage heterogeneous partition to achieve workload balance. Then, we present an effective three-stage indexing structure to efficiently manage objects. We also develop concurrent search methods with filtering and validation techniques that support efficient distributed similarity search. Additionally, we devise a cost-based optimization model to balance communication and computation cost. Extensive experiments demonstrate that DIMS significantly outperforms existing distributed similarity search approaches.

相似度搜索根据相似度度量查找与给定查询对象相似的对象。随着数据量和种类的不断增长，度量空间中的相似度搜索得到了广泛的关注。度量空间可以容纳任何类型的数据，并支持灵活的距离度量，使得度量空间中的相似性搜索对许多现实世界的应用程序有益，例如多媒体检索、个性化推荐、轨迹分析、数据挖掘、决策规划和分布式服务器。然而，现有的研究大多集中在单机器上索引度量空间，随着数据量和查询量的增加，这种方法面临效率和可扩展性的限制。近年来，相似度搜索技术的发展趋向于分布式方法，但也面临着本地数据管理效率低下、工作负载不平衡和并发搜索效率低等问题。为此，我们提出了一种高效的度量空间相似度搜索分布式索引DIMS。首先，我们设计了一种新的三阶段异构分区来实现工作负载平衡。然后，我们提出了一种有效的三阶段索引结构来有效地管理对象。我们还开发了具有过滤和验证技术的并发搜索方法，以支持高效的分布式相似度搜索。此外，我们设计了一个基于成本的优化模型来平衡通信和计算成本。大量的实验表明，DIMS显著优于现有的分布式相似度搜索方法。

{"title":"DIMS: Distributed Index for Similarity Search in Metric Spaces","authors":"Yifan Zhu;Chengyang Luo;Tang Qian;Lu Chen;Yunjun Gao;Baihua Zheng","doi":"10.1109/TKDE.2024.3487759","DOIUrl":"https://doi.org/10.1109/TKDE.2024.3487759","url":null,"abstract":"Similarity search finds objects that are similar to a given query object based on a similarity metric. As the amount and variety of data continue to grow, similarity search in metric spaces has gained significant attention. Metric spaces can accommodate any type of data and support flexible distance metrics, making similarity search in metric spaces beneficial for many real-world applications, such as multimedia retrieval, personalized recommendation, trajectory analytics, data mining, decision planning, and distributed servers. However, existing studies mostly focus on indexing metric spaces on a single machine, which faces efficiency and scalability limitations with increasing data volume and query amount. Recent advancements in similarity search turn towards distributed methods, while they face challenges including inefficient local data management, unbalanced workload, and low concurrent search efficiency. To this end, we propose \u0000<bold>DIMS\u0000, an efficient \u0000<bold>D\u0000istributed \u0000<bold>I\u0000ndex for similarity search in \u0000<bold>M\u0000etric \u0000<bold>S\u0000paces. First, we design a novel three-stage heterogeneous partition to achieve workload balance. Then, we present an effective three-stage indexing structure to efficiently manage objects. We also develop concurrent search methods with filtering and validation techniques that support efficient distributed similarity search. Additionally, we devise a cost-based optimization model to balance communication and computation cost. Extensive experiments demonstrate that DIMS significantly outperforms existing distributed similarity search approaches.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 1","pages":"210-225"},"PeriodicalIF":8.9,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142798048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fusing Monotonic Decision Tree Based on Related Family 基于相关族的单调决策树融合

IF 8.9 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering

Pub Date : 2024-10-29 DOI: 10.1109/TKDE.2024.3487641

Tian Yang;Fansong Yan;Fengcai Qiao;Jieting Wang;Yuhua Qian

Monotonic classification is a special ordinal classification task that involves monotonicity constraints between features and the decision. Monotonic feature selection can reduce dimensionality while preserving the monotonicity constraints, ultimately improving the efficiency and performance of monotonic classifiers. However, existing feature selection algorithms cannot handle large-scale monotonic data sets due to their lack of consideration for monotonic constraints or their high computational complexities. To address these issues, building on our team's previous research, we define the monotonic related family method with lower time complexity to select informative features and obtain multi-reducts carrying complementary information from multi-view for raw feature space. Using bi-directional rank mutual information, we build two trees for each feature subset and fuse all trees using the corresponding decision support level (BFMDT). Compared with six representative algorithms for monotonic feature selection, BFMDT's average classification accuracy increased by 4.06% (FFREMT), 6.77% (FCMT), 5.61% (FPRS_up), 6.05% (FPRS_down), 5.86%(FPRS_global), 4.41% (Bagging), 7.65% (REMT) and 21.89% (FMKNN), the average execution time compared to tree-based algorithms decreased by 83.41% (FFREMT), 96.96% (FCMT), 75.64% (FPRS_up), 59.43% (FPRS_down), 84.65%(FPRS_global), 81.50% (Bagging) and 63.41% (REMT), while most of comparing algorithms were unable to complete computation on six high-dimensional datasets.

单调分类是一种特殊的有序分类任务，涉及到特征与决策之间的单调性约束。单调特征选择可以在保持单调性约束的同时降低维数，最终提高单调分类器的效率和性能。然而，现有的特征选择算法由于缺乏对单调约束的考虑或计算复杂度高，无法处理大规模单调数据集。为了解决这些问题，我们在团队前期研究的基础上，定义了时间复杂度较低的单调相关族方法，对原始特征空间从多视角中选择信息特征，获得携带互补信息的多约简。利用双向秩互信息，为每个特征子集构建两棵树，并使用相应的决策支持水平（BFMDT）融合所有树。与6种代表性的单调特征选择算法相比，BFMDT的平均分类准确率分别提高了4.06% （FFREMT）、6.77% （FCMT）、5.61% （FPRS_up）、6.05% （FPRS_down）、5.86%（FPRS_global）、4.41% （Bagging）、7.65% （REMT）和21.89% (FMKNN)，平均执行时间比基于树的算法分别降低了83.41% （FFREMT）、96.96% （FCMT）、75.64% （FPRS_up）、59.43% （FPRS_down）、84.65%（FPRS_global）、81.50% （Bagging）和63.41% (REMT)；而大多数比较算法都无法在6个高维数据集上完成计算。

{"title":"Fusing Monotonic Decision Tree Based on Related Family","authors":"Tian Yang;Fansong Yan;Fengcai Qiao;Jieting Wang;Yuhua Qian","doi":"10.1109/TKDE.2024.3487641","DOIUrl":"https://doi.org/10.1109/TKDE.2024.3487641","url":null,"abstract":"Monotonic classification is a special ordinal classification task that involves monotonicity constraints between features and the decision. Monotonic feature selection can reduce dimensionality while preserving the monotonicity constraints, ultimately improving the efficiency and performance of monotonic classifiers. However, existing feature selection algorithms cannot handle large-scale monotonic data sets due to their lack of consideration for monotonic constraints or their high computational complexities. To address these issues, building on our team's previous research, we define the monotonic related family method with lower time complexity to select informative features and obtain multi-reducts carrying complementary information from multi-view for raw feature space. Using bi-directional rank mutual information, we build two trees for each feature subset and fuse all trees using the corresponding decision support level (BFMDT). Compared with six representative algorithms for monotonic feature selection, BFMDT's average classification accuracy increased by 4.06% (FFREMT), 6.77% (FCMT), 5.61% (FPRS_up), 6.05% (FPRS_down), 5.86%(FPRS_global), 4.41% (Bagging), 7.65% (REMT) and 21.89% (FMKNN), the average execution time compared to tree-based algorithms decreased by 83.41% (FFREMT), 96.96% (FCMT), 75.64% (FPRS_up), 59.43% (FPRS_down), 84.65%(FPRS_global), 81.50% (Bagging) and 63.41% (REMT), while most of comparing algorithms were unable to complete computation on six high-dimensional datasets.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 2","pages":"670-684"},"PeriodicalIF":8.9,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142940767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Achieving Efficient and Privacy-Preserving Reverse Skyline Query Over Single Cloud 在单云上实现高效且保护隐私的反向天际线查询

IF 8.9 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering

Pub Date : 2024-10-29 DOI: 10.1109/TKDE.2024.3487646

Yubo Peng;Xiong Li;Ke Gu;Jinjun Chen;Sajal K. Das;Xiaosong Zhang

Reverse skyline query (RSQ) has been widely used in practice since it can pick out the data of interest to the query vector. To save storage resources and facilitate service provision, data owners usually outsource data to the cloud for RSQ services, which poses huge challenges to data security and privacy protection. Existing privacy-preserving RSQ schemes are either based on a two-cloud model or cannot fully protect privacy. To this end, we propose an efficient privacy-preserving reverse skyline query scheme over a single cloud (ePRSQ). Specifically, we first design a privacy-preserving inner product's sign determination scheme (PIPSD), which can determine whether the inner product of two vectors satisfies a specific relation with 0 without leaking the vectors’ information. Next, we propose a privacy-preserving reverse dominance checking scheme (PRDC) based on symmetric homomorphic encryption. Finally, we achieve ePRSQ based on PIPSD and PRDC. Security analysis shows that PIPSD and PRDC are both secure in the real/ideal world model, and ePRSQ can protect the security of the dataset, the privacy of query requests and query results. Extensive experiments show that ePRSQ is efficient. Specifically, for a 3-dimensional dataset of size 1000, the computational and communication overheads of ePRSQ for a query are 79.47 s and 0.0021 MB, respectively. The efficiency is improved by

$3.78times$

(300.58 s) and

$928.57times$

(1.95 MB) respectively compared with PPARS, and by

$61.31times$

(4872.55 s) and

$407309times$

(855.35 MB) respectively compared with OPPRS.

反向天际线查询（RSQ）由于能够从查询向量中挑选出感兴趣的数据，在实践中得到了广泛的应用。为了节省存储资源和方便业务提供，数据所有者通常将数据外包到云端进行RSQ服务，这对数据安全和隐私保护提出了巨大的挑战。现有的保护隐私的RSQ方案要么基于双云模型，要么不能完全保护隐私。为此，我们提出了一种高效的单云上隐私保护反向天际线查询方案（ePRSQ）。具体而言，我们首先设计了一种保护隐私的内积符号确定方案（PIPSD），该方案可以在不泄露向量信息的情况下确定两个向量的内积是否满足与0的特定关系。接下来，我们提出了一种基于对称同态加密的保护隐私的反向优势校验方案（PRDC）。最后，在PIPSD和PRDC的基础上实现了ePRSQ。安全性分析表明，PIPSD和PRDC在现实/理想世界模型中都是安全的，而ePRSQ可以保护数据集的安全性、查询请求和查询结果的私密性。大量的实验表明，ePRSQ是有效的。具体来说，对于大小为1000的三维数据集，ePRSQ查询的计算开销和通信开销分别为79.47 s和0.0021 MB。与OPPRS相比，效率分别提高了3.78美元（300.58 s）和928.57美元（1.95 MB），分别提高了61.31美元（4872.55 s）和407309美元（855.35 MB）。

{"title":"Achieving Efficient and Privacy-Preserving Reverse Skyline Query Over Single Cloud","authors":"Yubo Peng;Xiong Li;Ke Gu;Jinjun Chen;Sajal K. Das;Xiaosong Zhang","doi":"10.1109/TKDE.2024.3487646","DOIUrl":"https://doi.org/10.1109/TKDE.2024.3487646","url":null,"abstract":"Reverse skyline query (RSQ) has been widely used in practice since it can pick out the data of interest to the query vector. To save storage resources and facilitate service provision, data owners usually outsource data to the cloud for RSQ services, which poses huge challenges to data security and privacy protection. Existing privacy-preserving RSQ schemes are either based on a two-cloud model or cannot fully protect privacy. To this end, we propose an efficient privacy-preserving reverse skyline query scheme over a single cloud (ePRSQ). Specifically, we first design a privacy-preserving inner product's sign determination scheme (PIPSD), which can determine whether the inner product of two vectors satisfies a specific relation with 0 without leaking the vectors’ information. Next, we propose a privacy-preserving reverse dominance checking scheme (PRDC) based on symmetric homomorphic encryption. Finally, we achieve ePRSQ based on PIPSD and PRDC. Security analysis shows that PIPSD and PRDC are both secure in the real/ideal world model, and ePRSQ can protect the security of the dataset, the privacy of query requests and query results. Extensive experiments show that ePRSQ is efficient. Specifically, for a 3-dimensional dataset of size 1000, the computational and communication overheads of ePRSQ for a query are 79.47 s and 0.0021 MB, respectively. The efficiency is improved by \u0000<inline-formula><tex-math>$3.78times$</tex-math></inline-formula>\u0000 (300.58 s) and \u0000<inline-formula><tex-math>$928.57times$</tex-math></inline-formula>\u0000 (1.95 MB) respectively compared with PPARS, and by \u0000<inline-formula><tex-math>$61.31times$</tex-math></inline-formula>\u0000 (4872.55 s) and \u0000<inline-formula><tex-math>$407309times$</tex-math></inline-formula>\u0000 (855.35 MB) respectively compared with OPPRS.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 1","pages":"29-44"},"PeriodicalIF":8.9,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142797948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CauseRuDi: Explaining Behavior Sequence Models by Causal Statistics Generation and Rule Distillation CauseRuDi：通过因果统计生成和规则蒸馏解释行为序列模型

IF 8.9 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering

Pub Date : 2024-10-29 DOI: 10.1109/TKDE.2024.3487625

Yao Zhang;Yun Xiong;Yiheng Sun;Yucheng Jin;Caihua Shan;Tian Lu;Hui Song;Shengli Sun

Risk scoring systems have been widely deployed in many applications, which assign risk scores to users according to their behavior sequences. Though many deep learning methods with sophisticated designs have achieved promising results, the black-box nature hinders their applications due to fairness, explainability, and compliance consideration. Rule-based systems are considered reliable in these sensitive scenarios. However, building a rule system is labor-intensive. Experts need to find informative statistics from user behavior sequences, design rules based on statistics and assign weights to each rule. In this paper, we bridge the gap between effective but black-box models and transparent rule models. We propose a two-stage framework, CauseRuDi, that distills the knowledge of black-box teacher models into rule-based student models. We design a Monte Carlo tree search-based statistics generation method that maximizes the correlation or dependence between the generated statistics and the teacher model's outputs. We formulate a sequential move game and a simultaneous move coalitional game to generate multiple statistics. Then statistics are composed into logical rules with our proposed neural logical networks by mimicking the outputs of teacher models. We evaluate CauseRuDi on three real-world public datasets and an industrial dataset to demonstrate its effectiveness.

风险评分系统根据用户的行为顺序对用户进行风险评分，已广泛应用于许多应用中。尽管许多设计复杂的深度学习方法取得了令人鼓舞的成果，但由于公平性、可解释性和合规性的考虑，黑箱性质阻碍了它们的应用。在这些敏感场景中，基于规则的系统被认为是可靠的。然而，建立规则系统是一项劳动密集型工作。专家需要从用户行为序列中找到信息统计，基于统计设计规则并为每个规则分配权重。在本文中，我们弥合了有效的黑盒模型和透明规则模型之间的差距。我们提出了一个两阶段框架CauseRuDi，它将黑箱教师模型的知识提炼成基于规则的学生模型。我们设计了一种基于蒙特卡罗树搜索的统计数据生成方法，该方法最大限度地提高了生成的统计数据与教师模型输出之间的相关性或依赖性。我们制定了一个顺序移动博弈和一个同步移动联合博弈来产生多个统计数据。然后通过模拟教师模型的输出，利用我们提出的神经逻辑网络将统计数据组成逻辑规则。我们在三个真实世界的公共数据集和一个工业数据集上评估了CauseRuDi，以证明其有效性。

{"title":"CauseRuDi: Explaining Behavior Sequence Models by Causal Statistics Generation and Rule Distillation","authors":"Yao Zhang;Yun Xiong;Yiheng Sun;Yucheng Jin;Caihua Shan;Tian Lu;Hui Song;Shengli Sun","doi":"10.1109/TKDE.2024.3487625","DOIUrl":"https://doi.org/10.1109/TKDE.2024.3487625","url":null,"abstract":"Risk scoring systems have been widely deployed in many applications, which assign risk scores to users according to their behavior sequences. Though many deep learning methods with sophisticated designs have achieved promising results, the black-box nature hinders their applications due to fairness, explainability, and compliance consideration. Rule-based systems are considered reliable in these sensitive scenarios. However, building a rule system is labor-intensive. Experts need to find informative statistics from user behavior sequences, design rules based on statistics and assign weights to each rule. In this paper, we bridge the gap between effective but black-box models and transparent rule models. We propose a two-stage framework, CauseRuDi, that distills the knowledge of black-box teacher models into rule-based student models. We design a Monte Carlo tree search-based statistics generation method that maximizes the correlation or dependence between the generated statistics and the teacher model's outputs. We formulate a sequential move game and a simultaneous move coalitional game to generate multiple statistics. Then statistics are composed into logical rules with our proposed neural logical networks by mimicking the outputs of teacher models. We evaluate CauseRuDi on three real-world public datasets and an industrial dataset to demonstrate its effectiveness.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 1","pages":"116-129"},"PeriodicalIF":8.9,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142797891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Angular Reconstructive Discrete Embedding With Fusion Similarity for Multi-View Clustering 多视图聚类的角度重构离散嵌入与融合相似性

IF 8.9 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering

Pub Date : 2024-10-29 DOI: 10.1109/TKDE.2024.3487907

Jintang Bian;Xiaohua Xie;Chang-Dong Wang;Lingxiao Yang;Jian-Huang Lai;Feiping Nie

Effectively and efficiently mining valuable clustering patterns is a challenging problem when handling large-scale data from diverse sources. Existing approaches adopt anchor graph learning or binary representation embedding to reduce computational complexity. Normally, anchor graph learning can not directly obtain the clustering assignment except adopt the post-processing stage, such as graph cut or k-means clustering. The binary representation embedding neglects the structure information in Hamming space. In order to overcome these limitations, this paper proposes a novel, effective, and efficient angular reconstructive discrete embedding method with fusion similarity for a multi-view clustering (AFMC) that can jointly learn the global and local structure preserving binary representation and clustering assignment. Specifically, we propose to use angular reconstructive error minimization to maintain the global similarity correlation of binary representations of heterogeneous features in a common Hamming space. Moreover, we design a multi-view discrete ridge regression with fusion similarity term to handle the out-of-sample problem and preserve the local manifold structure. In addition, we propose an efficient optimization algorithm with linear computational complexity to solve the non-convex and non-smooth objective function. The experimental results demonstrate that AFMC outperforms several state-of-the-art large-scale multi-view clustering methods.

在处理来自不同来源的大规模数据时，有效和高效地挖掘有价值的聚类模式是一个具有挑战性的问题。现有的方法采用锚图学习或二值表示嵌入来降低计算复杂度。锚点图学习通常不能直接获得聚类分配，只能采用图割或k-means聚类等后处理阶段。二值表示嵌入忽略了汉明空间中的结构信息。为了克服这些局限性，本文提出了一种新颖、有效、高效的多视图聚类（AFMC）的融合相似度角重构离散嵌入方法，该方法可以共同学习全局和局部结构，并保持二值表示和聚类分配。具体而言，我们提出使用角重构误差最小化来保持异构特征二进制表示在公共Hamming空间中的全局相似相关性。此外，我们设计了一种带有融合相似项的多视图离散脊回归来处理样本外问题并保留局部流形结构。此外，我们提出了一种具有线性计算复杂度的高效优化算法来求解非凸非光滑目标函数。实验结果表明，AFMC优于几种最先进的大规模多视图聚类方法。

{"title":"Angular Reconstructive Discrete Embedding With Fusion Similarity for Multi-View Clustering","authors":"Jintang Bian;Xiaohua Xie;Chang-Dong Wang;Lingxiao Yang;Jian-Huang Lai;Feiping Nie","doi":"10.1109/TKDE.2024.3487907","DOIUrl":"https://doi.org/10.1109/TKDE.2024.3487907","url":null,"abstract":"Effectively and efficiently mining valuable clustering patterns is a challenging problem when handling large-scale data from diverse sources. Existing approaches adopt anchor graph learning or binary representation embedding to reduce computational complexity. Normally, anchor graph learning can not directly obtain the clustering assignment except adopt the post-processing stage, such as graph cut or k-means clustering. The binary representation embedding neglects the structure information in Hamming space. In order to overcome these limitations, this paper proposes a novel, effective, and efficient angular reconstructive discrete embedding method with fusion similarity for a multi-view clustering (AFMC) that can jointly learn the global and local structure preserving binary representation and clustering assignment. Specifically, we propose to use angular reconstructive error minimization to maintain the global similarity correlation of binary representations of heterogeneous features in a common Hamming space. Moreover, we design a multi-view discrete ridge regression with fusion similarity term to handle the out-of-sample problem and preserve the local manifold structure. In addition, we propose an efficient optimization algorithm with linear computational complexity to solve the non-convex and non-smooth objective function. The experimental results demonstrate that AFMC outperforms several state-of-the-art large-scale multi-view clustering methods.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 1","pages":"45-59"},"PeriodicalIF":8.9,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142797990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0