ACM Transactions on Knowledge Discovery from Data最新文献_第10页

Bayesian Graph Local Extrema Convolution with Long-Tail Strategy for Misinformation Detection 针对误报检测的贝叶斯图局部极值卷积与长尾策略

IF 3.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data

Pub Date : 2024-01-03 DOI: 10.1145/3639408

Guixian Zhang, Shichao Zhang, Guan Yuan

It has become a cardinal task to identify fake information (misinformation) on social media because it has significantly harmed the government and the public. There are many spam bots maliciously retweeting misinformation. This study proposes an efficient model for detecting misinformation with self-supervised contrastive learning. A Bayesian graph Local extrema Convolution (BLC) is first proposed to aggregate node features in the graph structure. The BLC approach considers unreliable relationships and uncertainties in the propagation structure, and the differences between nodes and neighboring nodes are emphasized in the attributes. Then, a new long-tail strategy for matching long-tail users with the global social network is advocated to avoid over-concentration on high-degree nodes in graph neural networks. Finally, the proposed model is experimentally evaluated with two publicly Twitter datasets and demonstrates that the proposed long-tail strategy significantly improves the effectiveness of existing graph-based methods in terms of detecting misinformation. The robustness of BLC has also been examined on three graph datasets and demonstrates that it consistently outperforms traditional algorithms when perturbed by 15% of a dataset.

识别社交媒体上的虚假信息（错误信息）已成为一项重要任务，因为它已严重损害了政府和公众的利益。有许多垃圾机器人在恶意转发错误信息。本研究提出了一种利用自监督对比学习检测虚假信息的高效模型。首先提出了贝叶斯图局部极值卷积（BLC）来聚合图结构中的节点特征。BLC 方法考虑了传播结构中的不可靠关系和不确定性，并在属性中强调了节点与相邻节点之间的差异。然后，提倡一种新的长尾策略，用于将长尾用户与全局社交网络相匹配，以避免图神经网络过度集中于高度节点。最后，利用两个公开的 Twitter 数据集对所提出的模型进行了实验评估，结果表明所提出的长尾策略在检测错误信息方面显著提高了现有基于图的方法的有效性。我们还在三个图数据集上检验了 BLC 的鲁棒性，结果表明，当数据集受到 15% 的扰动时，BLC 的性能始终优于传统算法。

{"title":"Bayesian Graph Local Extrema Convolution with Long-Tail Strategy for Misinformation Detection","authors":"Guixian Zhang, Shichao Zhang, Guan Yuan","doi":"10.1145/3639408","DOIUrl":"https://doi.org/10.1145/3639408","url":null,"abstract":"It has become a cardinal task to identify fake information (misinformation) on social media because it has significantly harmed the government and the public. There are many spam bots maliciously retweeting misinformation. This study proposes an efficient model for detecting misinformation with self-supervised contrastive learning. A Bayesian graph Local extrema Convolution (BLC) is first proposed to aggregate node features in the graph structure. The BLC approach considers unreliable relationships and uncertainties in the propagation structure, and the differences between nodes and neighboring nodes are emphasized in the attributes. Then, a new long-tail strategy for matching long-tail users with the global social network is advocated to avoid over-concentration on high-degree nodes in graph neural networks. Finally, the proposed model is experimentally evaluated with two publicly Twitter datasets and demonstrates that the proposed long-tail strategy significantly improves the effectiveness of existing graph-based methods in terms of detecting misinformation. The robustness of BLC has also been examined on three graph datasets and demonstrates that it consistently outperforms traditional algorithms when perturbed by 15% of a dataset.","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"12 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139094847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Finding Subgraphs with Maximum Total Density and Limited Overlap in Weighted Hypergraphs 在加权超图中寻找具有最大总密度和有限重叠的子图

IF 3.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data

Pub Date : 2024-01-02 DOI: 10.1145/3639410

Oana Balalau, Francesco Bonchi, T-H. Hubert Chan, Francesco Gullo, Mauro Sozio, Hao Xie

Finding dense subgraphs in large (hyper)graphs is a key primitive in a variety of real-world application domains, encompassing social network analytics, event detection, biology, and finance. In most such applications, one typically aims at finding several (possibly overlapping) dense subgraphs which might correspond to communities in social networks or interesting events. While a large amount of work is devoted to finding a single densest subgraph, perhaps surprisingly, the problem of finding several dense subgraphs in weighted hypergraphs with limited overlap has not been studied in a principled way, to the best of our knowledge. In this work we define and study a natural generalization of the densest subgraph problem in weighted hypergraphs, where the main goal is to find at most k subgraphs with maximum total aggregate density, while satisfying an upper bound on the pairwise weighted Jaccard coefficient, i.e., the ratio of weights of intersection divided by weights of union on two nodes sets of the subgraphs. After showing that such a problem is NP-Hard, we devise an efficient algorithm that comes with provable guarantees in some cases of interest, as well as, an efficient practical heuristic. Our extensive evaluation on large real-world hypergraphs confirms the efficiency and effectiveness of our algorithms.

在大型（超）图中寻找稠密子图是现实世界中各种应用领域的关键基础，包括社交网络分析、事件检测、生物学和金融学。在大多数此类应用中，人们的目标通常是找到几个（可能重叠的）稠密子图，这些子图可能与社交网络中的社区或有趣的事件相对应。虽然大量工作致力于寻找单个最密集子图，但据我们所知，在重叠有限的加权超图中寻找多个密集子图的问题还没有得到原则性的研究，这或许令人惊讶。在这项工作中，我们定义并研究了加权超图中最密子图问题的自然概括，其主要目标是找到最多具有最大总密度的 k 个子图，同时满足成对加权 Jaccard 系数的上限，即子图中两个节点集的相交权重除以结合权重的比值。在证明这个问题是 NP-Hard（近乎困难）之后，我们设计了一种高效算法，该算法在某些感兴趣的情况下具有可证明的保证，同时也是一种高效实用的启发式算法。我们在大型真实超图上进行的广泛评估证实了我们算法的效率和有效性。

{"title":"Finding Subgraphs with Maximum Total Density and Limited Overlap in Weighted Hypergraphs","authors":"Oana Balalau, Francesco Bonchi, T-H. Hubert Chan, Francesco Gullo, Mauro Sozio, Hao Xie","doi":"10.1145/3639410","DOIUrl":"https://doi.org/10.1145/3639410","url":null,"abstract":"Finding dense subgraphs in large (hyper)graphs is a key primitive in a variety of real-world application domains, encompassing social network analytics, event detection, biology, and finance. In most such applications, one typically aims at finding several (possibly overlapping) dense subgraphs which might correspond to communities in social networks or interesting events. While a large amount of work is devoted to finding a single densest subgraph, perhaps surprisingly, the problem of finding several dense subgraphs in weighted hypergraphs with limited overlap has not been studied in a principled way, to the best of our knowledge. In this work we define and study a natural generalization of the densest subgraph problem in weighted hypergraphs, where the main goal is to find at most k subgraphs with maximum total aggregate density, while satisfying an upper bound on the pairwise weighted Jaccard coefficient, i.e., the ratio of weights of intersection divided by weights of union on two nodes sets of the subgraphs. After showing that such a problem is NP-Hard, we devise an efficient algorithm that comes with provable guarantees in some cases of interest, as well as, an efficient practical heuristic. Our extensive evaluation on large real-world hypergraphs confirms the efficiency and effectiveness of our algorithms.","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"21 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139096492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Concept Drift Adaptation by Exploiting Drift Type 利用漂移类型适应概念漂移

IF 3.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data

Pub Date : 2024-01-02 DOI: 10.1145/3638777

Jinpeng Li, Hang Yu, zhenyuzhang, Xiangfeng Luo, Shaorong Xie

Concept drift is a phenomenon where the distribution of data streams changes over time. When this happens, model predictions become less accurate. Hence, models built in the past need to be re-learned for the current data. Two design questions need to be addressed in designing a strategy to re-learn models: which type of concept drift has occurred, and how to utilize the drift type to improve re-learning performance. Existing drift detection methods are often good at determining when drift has occurred. However, few retrieve information about how the drift came to be present in the stream. Hence, determining the impact of the type of drift on adaptation is difficult. Filling this gap, we designed a framework based on a lazy strategy called Type-Driven Lazy Drift Adaptor (Type-LDA). Type-LDA first retrieves information about both how and when a drift has occurred, then it uses this information to re-learn the new model. To identify the type of drift, a drift type identifier is pre-trained on synthetic data of known drift types. Further, a drift point locator locates the optimal point of drift via a sharing loss. Hence, Type-LDA can select the optimal point, according to the drift type, to re-learn the new model. Experiments validate Type-LDA on both synthetic data and real-world data, and the results show that accurately identifying drift type can improve adaptation accuracy.

概念漂移是指数据流的分布随时间发生变化的现象。当这种情况发生时，模型预测的准确性就会降低。因此，需要针对当前数据重新学习过去建立的模型。在设计重新学习模型的策略时，需要解决两个设计问题：发生了哪种类型的概念漂移，以及如何利用漂移类型来提高重新学习性能。现有的漂移检测方法通常都能很好地确定漂移发生的时间。但是，很少有方法能检索到漂移是如何出现在数据流中的信息。因此，很难确定漂移类型对适应性的影响。为了填补这一空白，我们设计了一个基于懒惰策略的框架，称为类型驱动的懒惰漂移适配器（Type-LDA）。Type-LDA 首先检索漂移发生的方式和时间，然后利用这些信息重新学习新模型。为了识别漂移类型，漂移类型识别器会在已知漂移类型的合成数据上进行预训练。此外，漂移点定位器通过共享损失来定位最佳漂移点。因此，Type-LDA 可以根据漂移类型选择最佳点来重新学习新模型。实验在合成数据和真实世界数据上验证了 Type-LDA，结果表明，准确识别漂移类型可以提高适应精度。

{"title":"Concept Drift Adaptation by Exploiting Drift Type","authors":"Jinpeng Li, Hang Yu, zhenyuzhang, Xiangfeng Luo, Shaorong Xie","doi":"10.1145/3638777","DOIUrl":"https://doi.org/10.1145/3638777","url":null,"abstract":"Concept drift is a phenomenon where the distribution of data streams changes over time. When this happens, model predictions become less accurate. Hence, models built in the past need to be re-learned for the current data. Two design questions need to be addressed in designing a strategy to re-learn models: which type of concept drift has occurred, and how to utilize the drift type to improve re-learning performance. Existing drift detection methods are often good at determining when drift has occurred. However, few retrieve information about how the drift came to be present in the stream. Hence, determining the impact of the type of drift on adaptation is difficult. Filling this gap, we designed a framework based on a lazy strategy called Type-Driven Lazy Drift Adaptor (Type-LDA). Type-LDA first retrieves information about both how and when a drift has occurred, then it uses this information to re-learn the new model. To identify the type of drift, a drift type identifier is pre-trained on synthetic data of known drift types. Further, a drift point locator locates the optimal point of drift via a sharing loss. Hence, Type-LDA can select the optimal point, according to the drift type, to re-learn the new model. Experiments validate Type-LDA on both synthetic data and real-world data, and the results show that accurately identifying drift type can improve adaptation accuracy.","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"23 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139373231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ID-SR: Privacy-Preserving Social Recommendation based on Infinite Divisibility for Trustworthy AI ID-SR：基于无限可分性的隐私保护社交推荐，实现可信赖的人工智能

IF 3.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data

Pub Date : 2024-01-02 DOI: 10.1145/3639412

Jingyi Cui, Guangquan Xu, Jian Liu, Shicheng Feng, Jianli Wang, Hao Peng, Shihui Fu, Zhaohua Zheng, Xi Zheng, Shaoying Liu

Recommendation systems powered by AI are widely used to improve user experience. However, it inevitably raises privacy leakage and other security issues due to the utilization of extensive user data. Addressing these challenges can protect users’ personal information, benefit service providers, and foster service ecosystems. Presently, numerous techniques based on differential privacy have been proposed to solve this problem. However, existing solutions encounter issues such as inadequate data utilization and an tenuous trade-off between privacy protection and recommendation effectiveness. To enhance recommendation accuracy and protect users’ private data, we propose ID-SR, a novel privacy-preserving social recommendation scheme for trustworthy AI based on the infinite divisibility of Laplace distribution. We first introduce a novel recommendation method adopted in ID-SR, which is established based on matrix factorization with a newly designed social regularization term for improving recommendation effectiveness. Additionally, we propose a differential privacy preserving scheme tailored to the above method that leverages the Laplace distribution’s characteristics to safeguard user data. Theoretical analysis and experimentation evaluation on two publicly available datasets demonstrate that our scheme achieves a superior balance between privacy protection and recommendation effectiveness, ultimately delivering an enhanced user experience.

人工智能驱动的推荐系统被广泛用于改善用户体验。然而，由于使用了大量用户数据，它不可避免地会引发隐私泄露和其他安全问题。应对这些挑战可以保护用户的个人信息，使服务提供商受益，并促进服务生态系统的发展。目前，已有许多基于差异隐私的技术被提出来解决这一问题。然而，现有的解决方案存在数据利用率不足、隐私保护与推荐效果之间的权衡不明确等问题。为了提高推荐的准确性并保护用户的隐私数据，我们提出了基于拉普拉斯分布无限可分性的新型可信人工智能隐私保护社交推荐方案 ID-SR。我们首先介绍了 ID-SR 中采用的一种新型推荐方法，该方法基于矩阵因式分解建立，其中包含一个新设计的社会正则化项，用于提高推荐效果。此外，我们还针对上述方法提出了一种差分隐私保护方案，利用拉普拉斯分布的特性来保护用户数据。在两个公开数据集上进行的理论分析和实验评估表明，我们的方案在隐私保护和推荐效果之间实现了出色的平衡，最终带来了更佳的用户体验。

{"title":"ID-SR: Privacy-Preserving Social Recommendation based on Infinite Divisibility for Trustworthy AI","authors":"Jingyi Cui, Guangquan Xu, Jian Liu, Shicheng Feng, Jianli Wang, Hao Peng, Shihui Fu, Zhaohua Zheng, Xi Zheng, Shaoying Liu","doi":"10.1145/3639412","DOIUrl":"https://doi.org/10.1145/3639412","url":null,"abstract":"Recommendation systems powered by AI are widely used to improve user experience. However, it inevitably raises privacy leakage and other security issues due to the utilization of extensive user data. Addressing these challenges can protect users’ personal information, benefit service providers, and foster service ecosystems. Presently, numerous techniques based on differential privacy have been proposed to solve this problem. However, existing solutions encounter issues such as inadequate data utilization and an tenuous trade-off between privacy protection and recommendation effectiveness. To enhance recommendation accuracy and protect users’ private data, we propose ID-SR, a novel privacy-preserving social recommendation scheme for trustworthy AI based on the infinite divisibility of Laplace distribution. We first introduce a novel recommendation method adopted in ID-SR, which is established based on matrix factorization with a newly designed social regularization term for improving recommendation effectiveness. Additionally, we propose a differential privacy preserving scheme tailored to the above method that leverages the Laplace distribution’s characteristics to safeguard user data. Theoretical analysis and experimentation evaluation on two publicly available datasets demonstrate that our scheme achieves a superior balance between privacy protection and recommendation effectiveness, ultimately delivering an enhanced user experience.","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"26 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139094844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Utility-aware Privacy Perturbation for Training Data 训练数据的效用感知隐私扰动

IF 3.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data

Pub Date : 2024-01-02 DOI: 10.1145/3639411

Xinjiao Li, Guowei Wu, Lin Yao, Zhaolong Zheng, Shisong Geng

Data perturbation under differential privacy constraint is an important approach of protecting data privacy. However, as the data dimensions increase, the privacy budget allocated to each dimension decreases and thus the amount of noise added increases, which eventually leads to lower data utility in training tasks. To protect the privacy of training data while enhancing data utility, we propose an Utility-aware training data Privacy Perturbation scheme based on attribute Partition and budget Allocation (UPPPA). UPPPA includes three procedures, the quantification of attribute privacy and attribute importance, attribute partition, and budget allocation. The quantification of attribute privacy and attribute importance based on information entropy and attribute correlation provide an arithmetic basis for attribute partition and budget allocation. During the attribute partition, all attributes of training data are classified into high and low classes to achieve privacy amplification and utility enhancement. During the budget allocation, a γ-privacy model is proposed to balance data privacy and data utility so as to provide privacy constraint and guide budget allocation. Three comprehensive sets of real-world data are applied to evaluate the performance of UPPPA. Experiments and privacy analysis show that our scheme can achieve the tradeoff between privacy and utility.

差分隐私约束下的数据扰动是保护数据隐私的一种重要方法。然而，随着数据维度的增加，分配给每个维度的隐私预算会减少，因此增加的噪声量也会增加，最终导致训练任务中的数据效用降低。为了在提高数据效用的同时保护训练数据的隐私，我们提出了一种基于属性分割和预算分配的效用感知训练数据隐私扰动方案（UPPPA）。UPPPA 包括三个步骤：属性隐私和属性重要性量化、属性分区和预算分配。基于信息熵和属性相关性的属性隐私和属性重要性量化为属性分割和预算分配提供了运算基础。在属性划分过程中，将训练数据的所有属性划分为高类和低类，以实现隐私放大和效用增强。在预算分配过程中，提出了一个γ-隐私模型来平衡数据隐私和数据效用，从而提供隐私约束并指导预算分配。我们应用了三组全面的真实数据来评估 UPPPA 的性能。实验和隐私分析表明，我们的方案可以实现隐私和效用之间的权衡。

{"title":"Utility-aware Privacy Perturbation for Training Data","authors":"Xinjiao Li, Guowei Wu, Lin Yao, Zhaolong Zheng, Shisong Geng","doi":"10.1145/3639411","DOIUrl":"https://doi.org/10.1145/3639411","url":null,"abstract":"Data perturbation under differential privacy constraint is an important approach of protecting data privacy. However, as the data dimensions increase, the privacy budget allocated to each dimension decreases and thus the amount of noise added increases, which eventually leads to lower data utility in training tasks. To protect the privacy of training data while enhancing data utility, we propose an Utility-aware training data Privacy Perturbation scheme based on attribute Partition and budget Allocation (UPPPA). UPPPA includes three procedures, the quantification of attribute privacy and attribute importance, attribute partition, and budget allocation. The quantification of attribute privacy and attribute importance based on information entropy and attribute correlation provide an arithmetic basis for attribute partition and budget allocation. During the attribute partition, all attributes of training data are classified into high and low classes to achieve privacy amplification and utility enhancement. During the budget allocation, a γ-privacy model is proposed to balance data privacy and data utility so as to provide privacy constraint and guide budget allocation. Three comprehensive sets of real-world data are applied to evaluate the performance of UPPPA. Experiments and privacy analysis show that our scheme can achieve the tradeoff between privacy and utility.","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"2 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139096533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Semantics-enhanced Topic Modelling Technique: Semantic-LDA 语义增强型主题建模技术：语义-LDA

IF 3.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data

Pub Date : 2024-01-02 DOI: 10.1145/3639409

Dakshi Kapugama Geeganage, Yue Xu, Yuefeng Li

Topic modelling is a beneficial technique used to discover latent topics in text collections. But to correctly understand the text content and generate a meaningful topic list, semantics are important. By ignoring semantics, that is, not attempting to grasp the meaning of the words, most of the existing topic modelling approaches can generate some meaningless topic words. Even existing semantic-based approaches usually interpret the meanings of words without considering the context and related words. In this paper, we introduce a semantic-based topic model called semantic-LDA which captures the semantics of words in a text collection using concepts from an external ontology. A new method is introduced to identify and quantify the concept–word relationships based on matching words from the input text collection with concepts from an ontology without using pre-calculated values from the ontology that quantify the relationships between the words and concepts. These pre-calculated values may not reflect the actual relationships between words and concepts for the input collection because they are derived from datasets used to build the ontology rather than from the input collection itself. Instead, quantifying the relationship based on the word distribution in the input collection is more realistic and beneficial in the semantic capture process. Furthermore, an ambiguity handling mechanism is introduced to interpret the unmatched words, that is, words for which there are no matching concepts in the ontology. Thus, this paper makes a significant contribution by introducing a semantic-based topic model which calculates the word–concept relationships directly from the input text collection. The proposed semantic-based topic model and an enhanced version with the disambiguation mechanism were evaluated against a set of state-of-the-art systems, and our approaches outperformed the baseline systems in both topic quality and information filtering evaluations.

主题建模是一种用于发现文本集合中潜在主题的有效技术。但要正确理解文本内容并生成有意义的主题列表，语义非常重要。如果忽略语义，即不试图把握词语的含义，大多数现有的主题建模方法都会生成一些毫无意义的主题词。即使是现有的基于语义的方法，通常也是在不考虑上下文和相关词语的情况下解释词语的含义。在本文中，我们引入了一种基于语义的主题模型，称为语义-LDA，它使用外部本体中的概念来捕捉文本集合中单词的语义。我们引入了一种新方法，基于将输入文本集中的词与本体中的概念进行匹配来识别和量化概念与词之间的关系，而无需使用本体中预先计算的值来量化词与概念之间的关系。这些预计算值可能无法反映输入文集中单词与概念之间的实际关系，因为这些值是从用于构建本体的数据集而非输入文集本身得出的。相反，在语义捕捉过程中，根据输入集合中的词语分布来量化关系更为现实和有益。此外，本文还引入了一种歧义处理机制来解释未匹配词，即本体中没有匹配概念的词。因此，本文通过引入基于语义的主题模型，直接从输入文本集合中计算词-概念关系，做出了重大贡献。本文提出的基于语义的主题模型和带有消歧义机制的增强版本与一组最先进的系统进行了对比评估，在主题质量和信息过滤评估中，我们的方法都优于基线系统。

{"title":"A Semantics-enhanced Topic Modelling Technique: Semantic-LDA","authors":"Dakshi Kapugama Geeganage, Yue Xu, Yuefeng Li","doi":"10.1145/3639409","DOIUrl":"https://doi.org/10.1145/3639409","url":null,"abstract":"Topic modelling is a beneficial technique used to discover latent topics in text collections. But to correctly understand the text content and generate a meaningful topic list, semantics are important. By ignoring semantics, that is, not attempting to grasp the meaning of the words, most of the existing topic modelling approaches can generate some meaningless topic words. Even existing semantic-based approaches usually interpret the meanings of words without considering the context and related words. In this paper, we introduce a semantic-based topic model called semantic-LDA which captures the semantics of words in a text collection using concepts from an external ontology. A new method is introduced to identify and quantify the concept–word relationships based on matching words from the input text collection with concepts from an ontology without using pre-calculated values from the ontology that quantify the relationships between the words and concepts. These pre-calculated values may not reflect the actual relationships between words and concepts for the input collection because they are derived from datasets used to build the ontology rather than from the input collection itself. Instead, quantifying the relationship based on the word distribution in the input collection is more realistic and beneficial in the semantic capture process. Furthermore, an ambiguity handling mechanism is introduced to interpret the unmatched words, that is, words for which there are no matching concepts in the ontology. Thus, this paper makes a significant contribution by introducing a semantic-based topic model which calculates the word–concept relationships directly from the input text collection. The proposed semantic-based topic model and an enhanced version with the disambiguation mechanism were evaluated against a set of state-of-the-art systems, and our approaches outperformed the baseline systems in both topic quality and information filtering evaluations.","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"11 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139376284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multiple-Instance Learning from Triplet Comparison Bags 从三重比较包中进行多实例学习

IF 3.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data

Pub Date : 2024-01-02 DOI: 10.1145/3638776

Senlin Shu, Deng-Bao Wang, Suqin Yuan, Hongxin Wei, Jiuchuan Jiang, Lei Feng, Min-Ling Zhang

Multiple-instance learning (MIL) solves the problem where training instances are grouped in bags, and a binary (positive or negative) label is provided for each bag. Most of the existing MIL studies need fully labeled bags for training an effective classifier, while it could be quite hard to collect such data in many real-world scenarios, due to the high cost of data labeling process. Fortunately, unlike fully labeled data, triplet comparison data can be collected in a more accurate and human-friendly way. Therefore, in this paper, we for the first time investigate MIL from only triplet comparison bags, where a triplet (X_a, X_b, X_c) contains the weak supervision information that bag X_a is more similar to X_b than to X_c. To solve this problem, we propose to train a bag-level classifier by the empirical risk minimization framework and theoretically provide a generalization error bound. We also show that a convex formulation can be obtained only when specific convex binary losses such as the square loss and the double hinge loss are used. Extensive experiments validate that our proposed method significantly outperforms other baselines.

多实例学习（Multiple-instance Learning，MIL）解决的问题是将训练实例分组为袋，并为每个袋提供二元（正或负）标签。现有的大多数 MIL 研究都需要完全标记的袋来训练有效的分类器，而由于数据标记过程的成本较高，在现实世界的许多场景中可能很难收集到这样的数据。幸运的是，与完全标记数据不同，三元组比较数据可以以更准确、更人性化的方式收集。因此，在本文中，我们首次研究了仅来自三元组比较袋的 MIL，其中三元组（Xa, Xb, Xc）包含弱监督信息，即袋 Xa 与 Xb 的相似度高于与 Xc 的相似度。为了解决这个问题，我们建议通过经验风险最小化框架来训练袋级分类器，并从理论上提供了泛化误差约束。我们还证明，只有在使用特定的凸二元损失（如平方损失和双铰链损失）时，才能获得凸表述。大量实验验证了我们提出的方法明显优于其他基线方法。

{"title":"Multiple-Instance Learning from Triplet Comparison Bags","authors":"Senlin Shu, Deng-Bao Wang, Suqin Yuan, Hongxin Wei, Jiuchuan Jiang, Lei Feng, Min-Ling Zhang","doi":"10.1145/3638776","DOIUrl":"https://doi.org/10.1145/3638776","url":null,"abstract":"Multiple-instance learning (MIL) solves the problem where training instances are grouped in bags, and a binary (positive or negative) label is provided for each bag. Most of the existing MIL studies need fully labeled bags for training an effective classifier, while it could be quite hard to collect such data in many real-world scenarios, due to the high cost of data labeling process. Fortunately, unlike fully labeled data, triplet comparison data can be collected in a more accurate and human-friendly way. Therefore, in this paper, we for the first time investigate MIL from only triplet comparison bags, where a triplet (Xa, Xb, Xc) contains the weak supervision information that bag Xa is more similar to Xb than to Xc. To solve this problem, we propose to train a bag-level classifier by the empirical risk minimization framework and theoretically provide a generalization error bound. We also show that a convex formulation can be obtained only when specific convex binary losses such as the square loss and the double hinge loss are used. Extensive experiments validate that our proposed method significantly outperforms other baselines.","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"30 8","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139094801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HITS based Propagation Paradigm for Graph Neural Networks 基于 HITS 的图神经网络传播范式

IF 3.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data

Pub Date : 2023-12-30 DOI: 10.1145/3638779

Mehak Khan, Gustavo B. M. Mello, Laurence Habib, Paal Engelstad, Anis Yazidi

In this paper, we present a new propagation paradigm based on the principle of Hyperlink-Induced Topic Search (HITS) algorithm. The HITS algorithm utilizes the concept of a ”self-reinforcing” relationship of authority-hub. Using HITS, the centrality of nodes is determined via repeated updates of authority-hub scores that converge to a stationary distribution. Unlike PageRank-based propagation methods, which rely solely on the idea of authorities (in-links), HITS considers the relevance of both authorities (in-links) and hubs (out-links), thereby allowing for a more informative graph learning process. To segregate node prediction and propagation, we use a Multilayer Perceptron (MLP) in combination with a HITS-based propagation approach and propose two models; HITS-GNN and HITS-GNN+. We provided additional validation of our models’ efficacy by performing an ablation study to assess the performance of authority-hub in independent models. Moreover, the effect of the main hyper-parameters and normalization is also analyzed to uncover how these techniques influence the performance of our models. Extensive experimental results indicate that the proposed approach significantly improves baseline methods on the graph (citation network) benchmark datasets by a decent margin for semi-supervised node classification, which can aid in predicting the categories (labels) of scientific articles not exclusively based on their content but also based on the type of articles they cite.

在本文中，我们提出了一种基于超链接诱导主题搜索（HITS）算法原理的新传播范式。HITS 算法利用了权威-中心 "自我强化 "关系的概念。使用 HITS 算法，节点的中心性是通过反复更新权威-枢纽得分来确定的，而权威-枢纽得分会收敛到一个静态分布。基于 PageRank 的传播方法仅依赖于权威（内链路）的概念，而 HITS 则不同，它同时考虑了权威（内链路）和枢纽（外链路）的相关性，从而实现了信息量更大的图学习过程。为了分离节点预测和传播，我们将多层感知器（MLP）与基于 HITS 的传播方法相结合，并提出了两个模型：HITS-GNN 和 HITS-GNN+。我们通过进行消融研究来评估独立模型中权威枢纽的性能，从而对我们模型的功效进行了额外的验证。此外，我们还分析了主要超参数和归一化的影响，以揭示这些技术如何影响我们模型的性能。广泛的实验结果表明，在半监督节点分类方面，所提出的方法在图（引文网络）基准数据集上以相当大的优势显著改善了基准方法，这有助于预测科学文章的类别（标签），而不仅仅是基于其内容，还基于其引用的文章类型。

{"title":"HITS based Propagation Paradigm for Graph Neural Networks","authors":"Mehak Khan, Gustavo B. M. Mello, Laurence Habib, Paal Engelstad, Anis Yazidi","doi":"10.1145/3638779","DOIUrl":"https://doi.org/10.1145/3638779","url":null,"abstract":"In this paper, we present a new propagation paradigm based on the principle of Hyperlink-Induced Topic Search (HITS) algorithm. The HITS algorithm utilizes the concept of a ”self-reinforcing” relationship of authority-hub. Using HITS, the centrality of nodes is determined via repeated updates of authority-hub scores that converge to a stationary distribution. Unlike PageRank-based propagation methods, which rely solely on the idea of authorities (in-links), HITS considers the relevance of both authorities (in-links) and hubs (out-links), thereby allowing for a more informative graph learning process. To segregate node prediction and propagation, we use a Multilayer Perceptron (MLP) in combination with a HITS-based propagation approach and propose two models; HITS-GNN and HITS-GNN+. We provided additional validation of our models’ efficacy by performing an ablation study to assess the performance of authority-hub in independent models. Moreover, the effect of the main hyper-parameters and normalization is also analyzed to uncover how these techniques influence the performance of our models. Extensive experimental results indicate that the proposed approach significantly improves baseline methods on the graph (citation network) benchmark datasets by a decent margin for semi-supervised node classification, which can aid in predicting the categories (labels) of scientific articles not exclusively based on their content but also based on the type of articles they cite.","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"33 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2023-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139068613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

X-distribution: Retraceable Power-Law Exponent of Complex Networks X-distribution：复杂网络的可追溯幂律指数

IF 3.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data

Pub Date : 2023-12-30 DOI: 10.1145/3639413

Pradumn Kumar Pandey, Aikta Arya, Akrati Saxena

Network modeling has been explored extensively by means of theoretical analysis as well as numerical simulations for Network Reconstruction (NR). The network reconstruction problem requires the estimation of the power-law exponent (γ) of a given input network. Thus, the effectiveness of the NR solution depends on the accuracy of the calculation of γ. In this article, we re-examine the degree distribution-based estimation of γ, which is not very accurate due to approximations. We propose X-distribution, which is more accurate as compared to degree distribution. Various state-of-the-art network models, including CPM, NRM, RefOrCite2, BA, CDPAM, and DMS, are considered for simulation purposes, and simulated results support the proposed claim. Further, we apply X-distribution over several real-world networks to calculate their power-law exponents, which differ from those calculated using respective degree distributions. It is observed that X-distributions exhibit more linearity (straight line) on the log-log scale as compared to degree distributions. Thus, X-distribution is more suitable for the evaluation of power-law exponent using linear fitting (on the log-log scale). The MATLAB implementation of power-law exponent (γ) calculation using X-distribution for different network models, and the real-world datasets used in our experiments are available here: https://github.com/Aikta-Arya/X-distribution-Retraceable-Power-Law-Exponent-of-Complex-Networks.git

通过理论分析和网络重建（NR）的数值模拟，人们对网络建模进行了广泛的探索。网络重构问题需要估计给定输入网络的幂律指数（γ）。因此，NR 解决方案的有效性取决于 γ 计算的准确性。在本文中，我们重新审视了基于阶数分布的 γ 估计方法，由于存在近似值，该方法的准确性不高。我们提出了 X 分布，它比度分布更准确。为了模拟目的，我们考虑了各种最先进的网络模型，包括 CPM、NRM、RefOrCite2、BA、CDPAM 和 DMS。此外，我们将 X 分布应用于几个真实世界的网络，计算出它们的幂律指数，这些指数与使用各自的度分布计算出的指数不同。与学位分布相比，X 分布在对数尺度上表现出更多的线性（直线）。因此，X 分布更适合使用线性拟合（对数-对数尺度）来评估幂律指数。针对不同网络模型使用 X 分布计算幂律指数（γ）的 MATLAB 实现，以及我们实验中使用的真实世界数据集，可在此处获取： https://github.com/Aikta-Arya/X-distribution-Retraceable-Power-Law-Exponent-of-Complex-Networks.git

{"title":"X-distribution: Retraceable Power-Law Exponent of Complex Networks","authors":"Pradumn Kumar Pandey, Aikta Arya, Akrati Saxena","doi":"10.1145/3639413","DOIUrl":"https://doi.org/10.1145/3639413","url":null,"abstract":"Network modeling has been explored extensively by means of theoretical analysis as well as numerical simulations for Network Reconstruction (NR). The network reconstruction problem requires the estimation of the power-law exponent (γ) of a given input network. Thus, the effectiveness of the NR solution depends on the accuracy of the calculation of γ. In this article, we re-examine the degree distribution-based estimation of γ, which is not very accurate due to approximations. We propose X-distribution, which is more accurate as compared to degree distribution. Various state-of-the-art network models, including CPM, NRM, RefOrCite2, BA, CDPAM, and DMS, are considered for simulation purposes, and simulated results support the proposed claim. Further, we apply X-distribution over several real-world networks to calculate their power-law exponents, which differ from those calculated using respective degree distributions. It is observed that X-distributions exhibit more linearity (straight line) on the log-log scale as compared to degree distributions. Thus, X-distribution is more suitable for the evaluation of power-law exponent using linear fitting (on the log-log scale). The MATLAB implementation of power-law exponent (γ) calculation using X-distribution for different network models, and the real-world datasets used in our experiments are available here: https://github.com/Aikta-Arya/X-distribution-Retraceable-Power-Law-Exponent-of-Complex-Networks.git\u0000","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"6 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2023-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139068412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Diverse Structure-aware Relation Representation in Cross-Lingual Entity Alignment 跨语言实体对齐中的多元结构感知关系表征

IF 3.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data

Pub Date : 2023-12-29 DOI: 10.1145/3638778

Yuhong Zhang, Jianqing Wu, Kui Yu, Xindong Wu

Cross-lingual entity alignment (CLEA) aims to find equivalent entity pairs between knowledge graphs (KG) in different languages. It is an important way to connect heterogeneous KGs and facilitate knowledge completion. Existing methods have found that incorporating relations into entities can effectively improve KG representation and benefit entity alignment, and these methods learn relation representation depending on entities, which cannot capture the diverse structures of relations. However, multiple relations in KG form diverse structures, such as adjacency structure and ring structure. This diversity of relation structures makes the relation representation challenging. Therefore, we propose to construct the weighted line graphs to model the diverse structures of relations and learn relation representation independently from entities. Especially, owing to the diversity of adjacency structures and ring structures, we propose to construct adjacency line graph and ring line graph respectively to model the structures of relations and to further improve entity representation. In addition, to alleviate the hubness problem in alignment, we introduce the optimal transport into alignment and compute the distance matrix in a different way. From a global perspective, we calculate the optimal 1-to-1 alignment bi-directionally to improve the alignment accuracy. Experimental results on two benchmark datasets show that our proposed method significantly outperforms state-of-the-art CLEA methods in both supervised and unsupervised manners.

跨语言实体对齐（CLEA）旨在找到不同语言知识图谱（KG）之间的等效实体对。它是连接异构知识图谱和促进知识完备的重要途径。现有方法发现，将关系纳入实体可以有效改善知识图谱的表示，有利于实体配准，而这些方法是根据实体来学习关系表示的，无法捕捉关系的多样化结构。然而，KG 中的多种关系会形成多种结构，如邻接结构和环状结构。关系结构的多样性给关系表示带来了挑战。因此，我们提出构建加权线图来模拟关系的多样性结构，并从实体中独立学习关系表示。特别是，由于邻接结构和环状结构的多样性，我们建议分别构建邻接线图和环状线图来建立关系结构模型，以进一步改进实体表示。此外，为了缓解配准中的枢纽性问题，我们在配准中引入了最优传输，并以不同的方式计算距离矩阵。从全局的角度来看，我们计算最佳的 1 对 1 双向配准，以提高配准精度。在两个基准数据集上的实验结果表明，我们提出的方法在有监督和无监督的情况下都明显优于最先进的 CLEA 方法。

{"title":"Diverse Structure-aware Relation Representation in Cross-Lingual Entity Alignment","authors":"Yuhong Zhang, Jianqing Wu, Kui Yu, Xindong Wu","doi":"10.1145/3638778","DOIUrl":"https://doi.org/10.1145/3638778","url":null,"abstract":"Cross-lingual entity alignment (CLEA) aims to find equivalent entity pairs between knowledge graphs (KG) in different languages. It is an important way to connect heterogeneous KGs and facilitate knowledge completion. Existing methods have found that incorporating relations into entities can effectively improve KG representation and benefit entity alignment, and these methods learn relation representation depending on entities, which cannot capture the diverse structures of relations. However, multiple relations in KG form diverse structures, such as adjacency structure and ring structure. This diversity of relation structures makes the relation representation challenging. Therefore, we propose to construct the weighted line graphs to model the diverse structures of relations and learn relation representation independently from entities. Especially, owing to the diversity of adjacency structures and ring structures, we propose to construct adjacency line graph and ring line graph respectively to model the structures of relations and to further improve entity representation. In addition, to alleviate the hubness problem in alignment, we introduce the optimal transport into alignment and compute the distance matrix in a different way. From a global perspective, we calculate the optimal 1-to-1 alignment bi-directionally to improve the alignment accuracy. Experimental results on two benchmark datasets show that our proposed method significantly outperforms state-of-the-art CLEA methods in both supervised and unsupervised manners.","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"1 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2023-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139068366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0