Knowledge and Information Systems最新文献_第10页

Range control-based class imbalance and optimized granular elastic net regression feature selection for credit risk assessment 基于范围控制的类不平衡和优化颗粒弹性网回归特征选择，用于信贷风险评估

IF 2.7 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge and Information Systems

Pub Date : 2024-04-16 DOI: 10.1007/s10115-024-02103-9

Vadipina Amarnadh, Nageswara Rao Moparthi

Credit risk, stemming from the failure of a contractual party, is a significant variable in financial institutions. Assessing credit risk involves evaluating the creditworthiness of individuals, businesses, or entities to predict the likelihood of defaulting on financial obligations. While financial institutions categorize consumers based on creditworthiness, there is no universally defined set of attributes or indices. This research proposes Range control-based class imbalance and Optimized Granular Elastic Net regression (ROGENet) for feature selection in credit risk assessment. The dataset exhibits severe class imbalance, addressed using Range-Controlled Synthetic Minority Oversampling TEchnique (RCSMOTE). The balanced data undergo Granular Elastic Net regression with hybrid Gazelle sand cat Swarm Optimization (GENGSO) for feature selection. Elastic net, ensuring sparsity and grouping for correlated features, proves beneficial for assessing credit risk. ROGENet provides a detailed perspective on credit risk evaluation, surpassing conventional methods. The oversampling feature selection enhances the accuracy of minority class by 99.4, 99, 98.6 and 97.3%, respectively.

信用风险源于合同一方的违约，是金融机构的一个重要变量。评估信用风险涉及评估个人、企业或实体的信用度，以预测其拖欠金融债务的可能性。虽然金融机构会根据信用度对消费者进行分类，但并没有一套普遍定义的属性或指数。本研究提出了基于范围控制的类不平衡和优化颗粒弹性网回归（ROGENet），用于信用风险评估中的特征选择。数据集显示出严重的类不平衡，使用范围控制合成少数群体过度采样技术（RCSMOTE）解决了这一问题。平衡数据经过粒度弹性网回归和混合瞪羚沙猫群优化（GENGSO）进行特征选择。弹性网确保了相关特征的稀疏性和分组，证明有利于评估信贷风险。ROGENet 为信用风险评估提供了一个详细的视角，超越了传统方法。通过超采样特征选择，少数群体类别的准确率分别提高了 99.4%、99%、98.6% 和 97.3%。

{"title":"Range control-based class imbalance and optimized granular elastic net regression feature selection for credit risk assessment","authors":"Vadipina Amarnadh, Nageswara Rao Moparthi","doi":"10.1007/s10115-024-02103-9","DOIUrl":"https://doi.org/10.1007/s10115-024-02103-9","url":null,"abstract":"Credit risk, stemming from the failure of a contractual party, is a significant variable in financial institutions. Assessing credit risk involves evaluating the creditworthiness of individuals, businesses, or entities to predict the likelihood of defaulting on financial obligations. While financial institutions categorize consumers based on creditworthiness, there is no universally defined set of attributes or indices. This research proposes Range control-based class imbalance and Optimized Granular Elastic Net regression (ROGENet) for feature selection in credit risk assessment. The dataset exhibits severe class imbalance, addressed using Range-Controlled Synthetic Minority Oversampling TEchnique (RCSMOTE). The balanced data undergo Granular Elastic Net regression with hybrid Gazelle sand cat Swarm Optimization (GENGSO) for feature selection. Elastic net, ensuring sparsity and grouping for correlated features, proves beneficial for assessing credit risk. ROGENet provides a detailed perspective on credit risk evaluation, surpassing conventional methods. The oversampling feature selection enhances the accuracy of minority class by 99.4, 99, 98.6 and 97.3%, respectively.","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":"34 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140617307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Argumentation-based multi-agent distributed reasoning in dynamic and open environments 动态开放环境中基于论证的多代理分布式推理

IF 2.7 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge and Information Systems

Pub Date : 2024-04-15 DOI: 10.1007/s10115-024-02101-x

Helio Monte-Alto, Mariela Morveli-Espinoza, Cesar Tacla

This work presents an approach for distributed and contextualized reasoning in multi-agent systems, considering environments in which agents may have incomplete, uncertain and inconsistent knowledge. Knowledge is represented by defeasible logic with mapping rules, which model the capability of agents to acquire knowledge from other agents during reasoning. Based on such knowledge representation, an argumentation-based reasoning model that enables distributed building of reusable argument structures to support conclusions is proposed. Conflicts between arguments are resolved by an argument strength calculation that considers the trust among agents and the degree of similarity between knowledge of different agents, based on the intuition that greater similarity between knowledge defined by different agents implies in less uncertainty about the validity of the built argument. Contextualized reasoning is supported through sharing of relevant knowledge by an agent when issuing queries to other agents, which enable the cooperating agents to be aware of knowledge not known a priori but that is important to reach a reasonable conclusion given the context of the agent that issued the query. A distributed algorithm is presented and analytically and experimentally evaluated asserting its computational feasibility. Finally, our approach is compared to related work, highlighting the contributions presented, demonstrating its applicability in a broader range of scenarios, and presenting perspectives for future work.

考虑到代理可能拥有不完整、不确定和不一致的知识，本研究提出了一种在多代理系统中进行分布式和上下文推理的方法。知识由带有映射规则的可逆逻辑表示，映射规则模拟了代理在推理过程中从其他代理获取知识的能力。在这种知识表示法的基础上，我们提出了一种基于论证的推理模型，它可以分布式地构建可重复使用的论证结构，以支持结论。论证之间的冲突通过论证强度计算来解决，该计算考虑了代理之间的信任度和不同代理知识之间的相似度，其直观依据是不同代理定义的知识之间的相似度越高，意味着所构建论证有效性的不确定性越小。在向其他代理发出询问时，代理通过共享相关知识来支持情境推理，这使得合作代理能够了解到一些事先并不知晓的知识，但这些知识对于根据发出询问的代理的情境得出合理结论非常重要。我们提出了一种分布式算法，并对其进行了分析和实验评估，以确定其计算可行性。最后，将我们的方法与相关工作进行了比较，强调了所做的贡献，证明了它在更广泛场景中的适用性，并提出了未来工作的展望。

{"title":"Argumentation-based multi-agent distributed reasoning in dynamic and open environments","authors":"Helio Monte-Alto, Mariela Morveli-Espinoza, Cesar Tacla","doi":"10.1007/s10115-024-02101-x","DOIUrl":"https://doi.org/10.1007/s10115-024-02101-x","url":null,"abstract":"This work presents an approach for distributed and contextualized reasoning in multi-agent systems, considering environments in which agents may have incomplete, uncertain and inconsistent knowledge. Knowledge is represented by defeasible logic with mapping rules, which model the capability of agents to acquire knowledge from other agents during reasoning. Based on such knowledge representation, an argumentation-based reasoning model that enables distributed building of reusable argument structures to support conclusions is proposed. Conflicts between arguments are resolved by an argument strength calculation that considers the trust among agents and the degree of similarity between knowledge of different agents, based on the intuition that greater similarity between knowledge defined by different agents implies in less uncertainty about the validity of the built argument. Contextualized reasoning is supported through sharing of relevant knowledge by an agent when issuing queries to other agents, which enable the cooperating agents to be aware of knowledge not known a priori but that is important to reach a reasonable conclusion given the context of the agent that issued the query. A distributed algorithm is presented and analytically and experimentally evaluated asserting its computational feasibility. Finally, our approach is compared to related work, highlighting the contributions presented, demonstrating its applicability in a broader range of scenarios, and presenting perspectives for future work.\u0000","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":"82 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140601561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Graph neural architecture search with heterogeneous message-passing mechanisms 采用异构信息传递机制的图神经架构搜索

IF 2.7 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge and Information Systems

Pub Date : 2024-04-12 DOI: 10.1007/s10115-024-02090-x

Yili Wang, Jiamin Chen, Qiutong Li, Changlong He, Jianliang Gao

In recent years, neural network search has been utilized in designing effective heterogeneous graph neural networks (HGNN) and has achieved remarkable performance beyond manually designed networks. Generally, there are two mainstream design manners in heterogeneous graph neural architecture search (HGNAS). The one is to automatically design a meta-graph to guide the direction of message-passing in a heterogeneous graph, thereby obtaining semantic information. The other learns to design the convolutional operator aiming to enhance message extraction capabilities to handle the diverse information in a heterogeneous graph. Through experiments, we observe a strong interdependence between message-passing direction and message extraction, which has a significant impact on the performance of HGNNs. However, previous HGNAS methods focus on one-sided design and lacked the ability to capture this interdependence. To address the issue, we propose a novel perspective called heterogeneous message-passing mechanism for HGNAS, which enables HGNAS to effectively capture the interdependence between message-passing direction and message extraction for designing HGNNs with better performance automatically. We call our method heterogeneous message-passing mechanisms search (HMMS). Extensive experiments on two popular tasks show that our method designs powerful HGNNs that have achieved SOTA results in different benchmark datasets. Codes are available at https://github.com/HetGNAS/HMMS.

近年来，神经网络搜索已被用于设计有效的异构图神经网络（HGNN），并取得了超越人工设计网络的显著性能。一般来说，异构图神经架构搜索（HGNAS）有两种主流设计方式。一种是自动设计元图，引导异构图中的信息传递方向，从而获取语义信息。另一项研究是设计卷积算子，旨在增强信息提取能力，以处理异构图中的各种信息。通过实验，我们观察到信息传递方向与信息提取之间存在很强的相互依赖性，这对 HGNN 的性能有很大影响。然而，以往的 HGNAS 方法侧重于单侧设计，缺乏捕捉这种相互依存关系的能力。为了解决这个问题，我们提出了一种名为 HGNAS 异构消息传递机制的新观点，使 HGNAS 能够有效捕捉消息传递方向和消息提取之间的相互依存关系，从而自动设计出性能更好的 HGNN。我们称这种方法为异构消息传递机制搜索（HMMS）。在两个流行任务上的广泛实验表明，我们的方法设计出了强大的 HGNN，在不同的基准数据集上都取得了 SOTA 的成绩。代码见 https://github.com/HetGNAS/HMMS。

{"title":"Graph neural architecture search with heterogeneous message-passing mechanisms","authors":"Yili Wang, Jiamin Chen, Qiutong Li, Changlong He, Jianliang Gao","doi":"10.1007/s10115-024-02090-x","DOIUrl":"https://doi.org/10.1007/s10115-024-02090-x","url":null,"abstract":"In recent years, neural network search has been utilized in designing effective heterogeneous graph neural networks (HGNN) and has achieved remarkable performance beyond manually designed networks. Generally, there are two mainstream design manners in heterogeneous graph neural architecture search (HGNAS). The one is to automatically design a meta-graph to guide the direction of message-passing in a heterogeneous graph, thereby obtaining semantic information. The other learns to design the convolutional operator aiming to enhance message extraction capabilities to handle the diverse information in a heterogeneous graph. Through experiments, we observe a strong interdependence between message-passing direction and message extraction, which has a significant impact on the performance of HGNNs. However, previous HGNAS methods focus on one-sided design and lacked the ability to capture this interdependence. To address the issue, we propose a novel perspective called heterogeneous message-passing mechanism for HGNAS, which enables HGNAS to effectively capture the interdependence between message-passing direction and message extraction for designing HGNNs with better performance automatically. We call our method heterogeneous message-passing mechanisms search (HMMS). Extensive experiments on two popular tasks show that our method designs powerful HGNNs that have achieved SOTA results in different benchmark datasets. Codes are available at https://github.com/HetGNAS/HMMS.","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":"1 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140601555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adaptive semi-supervised learning from stronger augmentation transformations of discrete text information 从离散文本信息的强增强变换中进行自适应半监督学习

IF 2.7 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge and Information Systems

Pub Date : 2024-04-11 DOI: 10.1007/s10115-024-02100-y

Xuemiao Zhang, Zhouxing Tan, Fengyu Lu, Rui Yan, Junfei Liu

Semi-supervised learning is a promising approach to dealing with the problem of insufficient labeled data. Recent methods grouped into paradigms of consistency regularization and pseudo-labeling have outstanding performances on image data, but achieve limited improvements when employed for processing textual information, due to the neglect of the discrete nature of textual information and the lack of high-quality text augmentation transformation means. In this paper, we propose the novel SeqMatch method. It can automatically perceive abnormal model states caused by anomalous data obtained by text augmentations and reduce their interferences and instead leverages normal ones to improve the effectiveness of consistency regularization. And it generates hard artificial pseudo-labels to enable the model to be efficiently updated and optimized toward low entropy. We also design several much stronger well-organized text augmentation transformation pipelines to increase the divergence between two views of unlabeled discrete textual sequences, thus enabling the model to learn more knowledge from the alignment. Extensive comparative experimental results show that our SeqMatch outperforms previous methods on three widely used benchmarks significantly. In particular, SeqMatch can achieve a maximum performance improvement of 16.4% compared to purely supervised training when provided with a minimal number of labeled examples.

半监督学习是解决标记数据不足问题的一种有前途的方法。最近的一些方法分为一致性正则化和伪标记两种范式，这些方法在处理图像数据时表现出色，但在处理文本信息时，由于忽略了文本信息的离散性以及缺乏高质量的文本增强转换手段，改进效果有限。本文提出了新颖的 SeqMatch 方法。它能自动感知由文本扩增获得的异常数据所导致的异常模型状态，并减少其干扰，转而利用正常数据来提高一致性正则化的有效性。此外，它还能生成硬人工伪标签，使模型得到有效更新和优化，从而实现低熵。我们还设计了几种更强的组织良好的文本增强转换管道，以增加未标记的离散文本序列的两个视图之间的分歧，从而使模型能够从对齐中学习更多知识。广泛的对比实验结果表明，我们的 SeqMatch 在三个广泛使用的基准上明显优于之前的方法。特别是，与纯监督训练相比，SeqMatch 在提供极少量标注示例的情况下，最高性能提高了 16.4%。

{"title":"Adaptive semi-supervised learning from stronger augmentation transformations of discrete text information","authors":"Xuemiao Zhang, Zhouxing Tan, Fengyu Lu, Rui Yan, Junfei Liu","doi":"10.1007/s10115-024-02100-y","DOIUrl":"https://doi.org/10.1007/s10115-024-02100-y","url":null,"abstract":"Semi-supervised learning is a promising approach to dealing with the problem of insufficient labeled data. Recent methods grouped into paradigms of consistency regularization and pseudo-labeling have outstanding performances on image data, but achieve limited improvements when employed for processing textual information, due to the neglect of the discrete nature of textual information and the lack of high-quality text augmentation transformation means. In this paper, we propose the novel SeqMatch method. It can automatically perceive abnormal model states caused by anomalous data obtained by text augmentations and reduce their interferences and instead leverages normal ones to improve the effectiveness of consistency regularization. And it generates hard artificial pseudo-labels to enable the model to be efficiently updated and optimized toward low entropy. We also design several much stronger well-organized text augmentation transformation pipelines to increase the divergence between two views of unlabeled discrete textual sequences, thus enabling the model to learn more knowledge from the alignment. Extensive comparative experimental results show that our SeqMatch outperforms previous methods on three widely used benchmarks significantly. In particular, SeqMatch can achieve a maximum performance improvement of 16.4% compared to purely supervised training when provided with a minimal number of labeled examples.","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":"35 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140601558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deep graph clustering via mutual information maximization and mixture model 通过互信息最大化和混合模型进行深度图聚类

IF 2.7 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge and Information Systems

Pub Date : 2024-04-10 DOI: 10.1007/s10115-024-02097-4

Maedeh Ahmadi, Mehran Safayani, Abdolreza Mirzaei

Attributed graph clustering or community detection which learns to cluster the nodes of a graph is a challenging task in graph analysis. Recently contrastive learning has shown significant results in various unsupervised graph learning tasks. In spite of the success of graph contrastive learning methods in self-supervised graph learning, using them for graph clustering is not well explored. In this paper, we introduce a contrastive learning framework for learning clustering-friendly node embedding. We propose Gaussian mixture information maximization which utilizes a mutual information maximization approach for node embedding. Meanwhile, in order to have a clustering-friendly embedding space, it imposes a mixture of Gaussians distribution on this space. The parameters of the contrastive node embedding model and the mixture distribution are optimized jointly in a unified framework. Experiments show that our clustering-directed embedding space can enhance clustering performance in comparison with the case where community structure of the graph is ignored during node representation learning. The results on real-world datasets demonstrate the effectiveness of our method in community detection.

归属图聚类或社群检测是图分析中一项具有挑战性的任务。最近，对比学习在各种无监督图学习任务中取得了显著成果。尽管图对比学习方法在自监督图学习中取得了成功，但将其用于图聚类的研究还不多。在本文中，我们介绍了一种对比学习框架，用于学习对聚类友好的节点嵌入。我们提出了高斯混合信息最大化方法，利用互信息最大化方法进行节点嵌入。同时，为了获得对聚类友好的嵌入空间，它对该空间施加了高斯混合分布。对比节点嵌入模型和混合分布的参数在一个统一的框架中共同优化。实验表明，与在节点表示学习过程中忽略图的群落结构的情况相比，我们的聚类导向嵌入空间可以提高聚类性能。在实际数据集上的结果证明了我们的方法在群落检测中的有效性。

{"title":"Deep graph clustering via mutual information maximization and mixture model","authors":"Maedeh Ahmadi, Mehran Safayani, Abdolreza Mirzaei","doi":"10.1007/s10115-024-02097-4","DOIUrl":"https://doi.org/10.1007/s10115-024-02097-4","url":null,"abstract":"Attributed graph clustering or community detection which learns to cluster the nodes of a graph is a challenging task in graph analysis. Recently contrastive learning has shown significant results in various unsupervised graph learning tasks. In spite of the success of graph contrastive learning methods in self-supervised graph learning, using them for graph clustering is not well explored. In this paper, we introduce a contrastive learning framework for learning clustering-friendly node embedding. We propose Gaussian mixture information maximization which utilizes a mutual information maximization approach for node embedding. Meanwhile, in order to have a clustering-friendly embedding space, it imposes a mixture of Gaussians distribution on this space. The parameters of the contrastive node embedding model and the mixture distribution are optimized jointly in a unified framework. Experiments show that our clustering-directed embedding space can enhance clustering performance in comparison with the case where community structure of the graph is ignored during node representation learning. The results on real-world datasets demonstrate the effectiveness of our method in community detection.\u0000","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":"55 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140601556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing Multi-Attribute Similarity Join using Reduced and Adaptive Index Trees 使用精简索引树和自适应索引树增强多属性相似性连接

IF 2.7 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge and Information Systems

Pub Date : 2024-04-09 DOI: 10.1007/s10115-024-02089-4

Vítor Bezerra Silva, Dimas Cassimiro Nascimento

Multi-Attribute Similarity Join represents an important task for a variety of applications. Due to a large amount of data, several techniques and approaches were proposed to avoid superfluous comparisons between entities. One of these techniques is denominated Index Tree. In this work, we proposed an adaptive version (Adaptive Index Tree) of the state-of-the-art Index Tree for multi-attribute data. Our method selects the best filter configuration to construct the Adaptive Index Tree. We also proposed a reduced version of the Index Trees, aiming to improve the trade-off between efficacy and efficiency for the Similarity Join task. Finally, we proposed Filter and Feature selectors designed for the Similarity Join task. To evaluate the impact of the proposed approaches, we employed five real-world datasets to perform the experimental analysis. Based on the experiments, we conclude that our reduced approaches have produced superior results when compared to the state-of-the-art approach, specially when dealing with datasets that present a significant number of attributes and/or and expressive attribute sizes.

多属性相似性连接是各种应用中的一项重要任务。由于数据量巨大，人们提出了一些技术和方法来避免实体间多余的比较。其中一种技术被称为索引树。在这项工作中，我们为多属性数据提出了最先进索引树的自适应版本（自适应索引树）。我们的方法选择最佳过滤器配置来构建自适应索引树。我们还提出了一种缩小版索引树，旨在改善相似性连接任务的功效和效率之间的权衡。最后，我们提出了专为相似性连接任务设计的过滤器和特征选择器。为了评估所提方法的效果，我们使用了五个真实世界的数据集进行实验分析。根据实验结果，我们得出结论：与最先进的方法相比，我们的简化方法产生了更优越的结果，特别是在处理具有大量属性和/或具有表现力属性大小的数据集时。

{"title":"Enhancing Multi-Attribute Similarity Join using Reduced and Adaptive Index Trees","authors":"Vítor Bezerra Silva, Dimas Cassimiro Nascimento","doi":"10.1007/s10115-024-02089-4","DOIUrl":"https://doi.org/10.1007/s10115-024-02089-4","url":null,"abstract":"Multi-Attribute Similarity Join represents an important task for a variety of applications. Due to a large amount of data, several techniques and approaches were proposed to avoid superfluous comparisons between entities. One of these techniques is denominated Index Tree. In this work, we proposed an adaptive version (Adaptive Index Tree) of the state-of-the-art Index Tree for multi-attribute data. Our method selects the best filter configuration to construct the Adaptive Index Tree. We also proposed a reduced version of the Index Trees, aiming to improve the trade-off between efficacy and efficiency for the Similarity Join task. Finally, we proposed Filter and Feature selectors designed for the Similarity Join task. To evaluate the impact of the proposed approaches, we employed five real-world datasets to perform the experimental analysis. Based on the experiments, we conclude that our reduced approaches have produced superior results when compared to the state-of-the-art approach, specially when dealing with datasets that present a significant number of attributes and/or and expressive attribute sizes.\u0000","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":"45 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140601525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Noise-free sampling with majority framework for an imbalanced classification problem 针对不平衡分类问题的无噪声采样与多数框架

IF 2.7 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge and Information Systems

Pub Date : 2024-04-09 DOI: 10.1007/s10115-024-02079-6

Neni Alya Firdausanti, Israel Mendonça, Masayoshi Aritsugi

Class imbalance has been widely accepted as a significant factor that negatively impacts a machine learning classifier’s performance. One of the techniques to avoid this problem is to balance the data distribution by using sampling-based approaches, in which synthetic data is generated using the probability distribution of the classes. However, this process is sensitive to the presence of noise in the data, and the boundaries between the majority class and the minority class are blurred. Such phenomena shift the algorithm’s decision boundary away from the ideal outcome. In this work, we propose a hybrid framework for two primary objectives. The first objective is to address class distribution imbalance by synthetically increasing the data of a minority class, and the second objective is, to devise an efficient noise reduction technique that improves the class balance algorithm. The proposed framework focuses on removing noisy elements from the majority class, and by doing so, provides more accurate information to the subsequent synthetic data generator algorithm. To evaluate the effectiveness of our framework, we employ the geometric mean (G-mean) as the evaluation metric. The experimental results show that our framework is capable of improving the prediction G-mean for eight classifiers across eleven datasets. The range of improvements varies from 7.78% on the Loan dataset to 67.45% on the Abalone19_vs_10-11-12-13 dataset.

类不平衡已被广泛认为是对机器学习分类器性能产生负面影响的一个重要因素。避免这一问题的技术之一是使用基于采样的方法来平衡数据分布，即使用类的概率分布生成合成数据。然而，这一过程对数据中存在的噪声很敏感，多数类和少数类之间的界限会变得模糊。这种现象会使算法的决策边界偏离理想结果。在这项工作中，我们针对两个主要目标提出了一个混合框架。第一个目标是通过合成增加少数类别的数据来解决类别分布不平衡的问题，第二个目标是设计一种有效的降噪技术来改进类别平衡算法。所提出的框架侧重于去除多数类中的噪声元素，从而为后续的合成数据生成算法提供更准确的信息。为了评估框架的有效性，我们采用了几何平均数（G-mean）作为评估指标。实验结果表明，我们的框架能够提高 11 个数据集上 8 个分类器的预测几何平均数。改进幅度从贷款数据集的 7.78% 到 Abalone19_vs_10-11-12-13 数据集的 67.45%。

{"title":"Noise-free sampling with majority framework for an imbalanced classification problem","authors":"Neni Alya Firdausanti, Israel Mendonça, Masayoshi Aritsugi","doi":"10.1007/s10115-024-02079-6","DOIUrl":"https://doi.org/10.1007/s10115-024-02079-6","url":null,"abstract":"Class imbalance has been widely accepted as a significant factor that negatively impacts a machine learning classifier’s performance. One of the techniques to avoid this problem is to balance the data distribution by using sampling-based approaches, in which synthetic data is generated using the probability distribution of the classes. However, this process is sensitive to the presence of noise in the data, and the boundaries between the majority class and the minority class are blurred. Such phenomena shift the algorithm’s decision boundary away from the ideal outcome. In this work, we propose a hybrid framework for two primary objectives. The first objective is to address class distribution imbalance by synthetically increasing the data of a minority class, and the second objective is, to devise an efficient noise reduction technique that improves the class balance algorithm. The proposed framework focuses on removing noisy elements from the majority class, and by doing so, provides more accurate information to the subsequent synthetic data generator algorithm. To evaluate the effectiveness of our framework, we employ the geometric mean (G-mean) as the evaluation metric. The experimental results show that our framework is capable of improving the prediction G-mean for eight classifiers across eleven datasets. The range of improvements varies from 7.78% on the Loan dataset to 67.45% on the Abalone19_vs_10-11-12-13 dataset.","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":"55 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140601721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing knowledge discovery and management through intelligent computing methods: a decisive investigation 通过智能计算方法加强知识发现和管理：一项决定性调查

IF 2.7 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge and Information Systems

Pub Date : 2024-04-09 DOI: 10.1007/s10115-024-02099-2

Rayees Ahamad, Kamta Nath Mishra

Knowledge Discovery and Management (KDM) encompasses a comprehensive process and approach involving the creation, discovery, capture, organization, refinement, presentation, and provision of data, information, and knowledge with a specific goal in mind. At the core, Knowledge Management and Artificial Intelligence (AI) revolve around knowledge itself. AI serves as the mechanism enabling machines to obtain, acquire, process, and utilize information, thereby executing tasks and uncovering knowledge that can be shared with people to enhance strategic decision-making. While conventional methods play a role in the KDM process, incorporating intelligent approaches can further enhance efficiency in terms of time and accuracy. Intelligent techniques, particularly soft computing approaches, possess the ability to learn in any environment by leveraging logic, reasoning, and other computational capabilities. These techniques can be broadly categorized into Learning algorithms (Supervised, Unsupervised, and Reinforcement), Logic and Rule-Based algorithms (Fuzzy Logic, Bayesian Network, and CBR-RBR), Nature-inspired algorithms (Genetic algorithm, Particle Swarm Optimization, and Ant Colony Optimization), and hybrid approaches that combine these algorithms. The primary objective of these intelligent techniques is to address the day-to-day challenges faced by rural and smart digital societies. In this study, the authors extensively investigated various intelligent computing methods (ICMs) specifically relevant to distinct problems, providing accurate and reasonable knowledge-based solutions. The application of both single ICMs and combined ICMs was explored to solve domain-specific problems, and their effectiveness was analyzed and discussed. The results indicated that combined ICMs exhibited superior efficiency compared to single ICMs. Furthermore, the authors conducted an analysis and comparison of ICMs based on their application domain, parameters, methods/algorithms, efficiency, and acceptable outcomes. Additionally, the authors identified several problem scenarios that can be effectively resolved using intelligent techniques.

知识发现与管理（KDM）包含一个全面的过程和方法，涉及数据、信息和知识的创建、发现、捕获、组织、提炼、呈现和提供，并以特定的目标为导向。知识管理和人工智能（AI）的核心是知识本身。人工智能作为一种机制，使机器能够获得、获取、处理和利用信息，从而执行任务并发掘可与人共享的知识，以加强战略决策。虽然传统方法在知识管理过程中发挥着一定作用，但采用智能方法可以进一步提高时间和准确性方面的效率。智能技术，尤其是软计算方法，拥有在任何环境下利用逻辑、推理和其他计算能力进行学习的能力。这些技术可大致分为学习算法（监督式、非监督式和强化式）、基于逻辑和规则的算法（模糊逻辑、贝叶斯网络和 CBR-RBR）、自然启发算法（遗传算法、粒子群优化和蚁群优化）以及结合了这些算法的混合方法。这些智能技术的主要目标是应对农村和智能数字社会所面临的日常挑战。在本研究中，作者广泛研究了各种智能计算方法（ICM），特别是与不同问题相关的智能计算方法，提供了准确合理的基于知识的解决方案。研究人员探索了单一智能计算方法和组合智能计算方法在解决特定领域问题中的应用，并对其有效性进行了分析和讨论。结果表明，与单一 ICM 相比，组合 ICM 表现出更高的效率。此外，作者还根据 ICM 的应用领域、参数、方法/算法、效率和可接受的结果，对 ICM 进行了分析和比较。此外，作者还确定了几种可以利用智能技术有效解决的问题场景。

{"title":"Enhancing knowledge discovery and management through intelligent computing methods: a decisive investigation","authors":"Rayees Ahamad, Kamta Nath Mishra","doi":"10.1007/s10115-024-02099-2","DOIUrl":"https://doi.org/10.1007/s10115-024-02099-2","url":null,"abstract":"Knowledge Discovery and Management (KDM) encompasses a comprehensive process and approach involving the creation, discovery, capture, organization, refinement, presentation, and provision of data, information, and knowledge with a specific goal in mind. At the core, Knowledge Management and Artificial Intelligence (AI) revolve around knowledge itself. AI serves as the mechanism enabling machines to obtain, acquire, process, and utilize information, thereby executing tasks and uncovering knowledge that can be shared with people to enhance strategic decision-making. While conventional methods play a role in the KDM process, incorporating intelligent approaches can further enhance efficiency in terms of time and accuracy. Intelligent techniques, particularly soft computing approaches, possess the ability to learn in any environment by leveraging logic, reasoning, and other computational capabilities. These techniques can be broadly categorized into Learning algorithms (Supervised, Unsupervised, and Reinforcement), Logic and Rule-Based algorithms (Fuzzy Logic, Bayesian Network, and CBR-RBR), Nature-inspired algorithms (Genetic algorithm, Particle Swarm Optimization, and Ant Colony Optimization), and hybrid approaches that combine these algorithms. The primary objective of these intelligent techniques is to address the day-to-day challenges faced by rural and smart digital societies. In this study, the authors extensively investigated various intelligent computing methods (ICMs) specifically relevant to distinct problems, providing accurate and reasonable knowledge-based solutions. The application of both single ICMs and combined ICMs was explored to solve domain-specific problems, and their effectiveness was analyzed and discussed. The results indicated that combined ICMs exhibited superior efficiency compared to single ICMs. Furthermore, the authors conducted an analysis and comparison of ICMs based on their application domain, parameters, methods/algorithms, efficiency, and acceptable outcomes. Additionally, the authors identified several problem scenarios that can be effectively resolved using intelligent techniques.","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":"2011 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140601616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Rényi-type quasimetric with random interference detection 具有随机干扰检测功能的雷尼式准测器

IF 2.7 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge and Information Systems

Pub Date : 2024-04-09 DOI: 10.1007/s10115-024-02078-7

Roy Cerqueti, Mario Maggi

This paper introduces a new dissimilarity measure between two discrete and finite probability distributions. The followed approach is grounded jointly on mixtures of probability distributions and an optimization procedure. We discuss the clear interpretation of the constitutive elements of the measure under an information-theoretical perspective by also highlighting its connections with the Rényi divergence of infinite order. Moreover, we show how the measure describes the inefficiency in assuming that a given probability distribution coincides with a benchmark one by giving formal writing of the random interference between the considered probability distributions. We explore the properties of the considered tool, which are in line with those defining the concept of quasimetric—i.e. a divergence for which the triangular inequality is satisfied. As a possible usage of the introduced device, an application to rare events is illustrated. This application shows that our measure may be suitable in cases where the accuracy of the small probabilities is a relevant matter.

本文介绍了两种离散和有限概率分布之间新的不相似度量。该方法以概率分布混合物和优化程序为基础。我们从信息论的角度讨论了该度量构成要素的清晰解释，同时强调了它与无穷阶雷尼发散的联系。此外，我们还通过对所考虑的概率分布之间的随机干扰进行正式书写，展示了该度量如何描述假设给定概率分布与基准概率分布重合的低效率。我们探讨了所考虑的工具的属性，这些属性与定义类比概念的属性一致，即满足三角不等式的发散。作为所引入工具的一种可能用法，我们对罕见事件的应用进行了说明。这一应用表明，在小概率的准确性是一个相关问题的情况下，我们的测量方法可能是合适的。

引用次数: 0

Efficient parameter learning for Bayesian Network classifiers following the Apache Spark Dataframes paradigm 按照 Apache Spark 数据框架范例高效学习贝叶斯网络分类器的参数

IF 2.7 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge and Information Systems

Pub Date : 2024-04-08 DOI: 10.1007/s10115-024-02096-5

Ioannis Akarepis, Agorakis Bompotas, Christos Makris

Every year the volume of information is growing at a high rate; therefore, more modern approaches are required to deal with such issues efficiently. Distributed systems, such as Apache Spark, offer such a modern approach, resulting in more and more machine learning models, being adapted into using distributed logic. In this paper, we propose a classification model, based on Bayesian Networks (BNs), that utilizes the distributed environment of Apache Spark using the Dataframes paradigm. This model can exploit any user-provided directed acyclic graph (DAG) that portrays the dependencies between the features of a dataset to estimate the parameters of the conditional probability distributions associated with each node in the graph to make accurate predictions. Moreover, in contrast with the majority of implementations that are only able to handle discrete features, it is also capable of efficiently handling continuous features by calculating the Gaussian probability density function.

每年，信息量都在高速增长；因此，需要更现代化的方法来高效处理这些问题。分布式系统（如 Apache Spark）提供了这样一种现代化的方法，使得越来越多的机器学习模型开始采用分布式逻辑。在本文中，我们提出了一种基于贝叶斯网络（BN）的分类模型，该模型利用数据帧范式，利用 Apache Spark 的分布式环境。该模型可以利用用户提供的任何描绘数据集特征之间依赖关系的有向无环图（DAG）来估计与图中每个节点相关的条件概率分布参数，从而做出准确的预测。此外，与大多数只能处理离散特征的实现不同，它还能通过计算高斯概率密度函数有效地处理连续特征。

引用次数: 0