It has become a cardinal task to identify fake information (misinformation) on social media because it has significantly harmed the government and the public. There are many spam bots maliciously retweeting misinformation. This study proposes an efficient model for detecting misinformation with self-supervised contrastive learning. A Bayesian graph Local extrema Convolution (BLC) is first proposed to aggregate node features in the graph structure. The BLC approach considers unreliable relationships and uncertainties in the propagation structure, and the differences between nodes and neighboring nodes are emphasized in the attributes. Then, a new long-tail strategy for matching long-tail users with the global social network is advocated to avoid over-concentration on high-degree nodes in graph neural networks. Finally, the proposed model is experimentally evaluated with two publicly Twitter datasets and demonstrates that the proposed long-tail strategy significantly improves the effectiveness of existing graph-based methods in terms of detecting misinformation. The robustness of BLC has also been examined on three graph datasets and demonstrates that it consistently outperforms traditional algorithms when perturbed by 15% of a dataset.
{"title":"Bayesian Graph Local Extrema Convolution with Long-Tail Strategy for Misinformation Detection","authors":"Guixian Zhang, Shichao Zhang, Guan Yuan","doi":"10.1145/3639408","DOIUrl":"https://doi.org/10.1145/3639408","url":null,"abstract":"<p>It has become a cardinal task to identify fake information (misinformation) on social media because it has significantly harmed the government and the public. There are many spam bots maliciously retweeting misinformation. This study proposes an efficient model for detecting misinformation with self-supervised contrastive learning. A <b>B</b>ayesian graph <b>L</b>ocal extrema <b>C</b>onvolution (BLC) is first proposed to aggregate node features in the graph structure. The BLC approach considers unreliable relationships and uncertainties in the propagation structure, and the differences between nodes and neighboring nodes are emphasized in the attributes. Then, a new long-tail strategy for matching long-tail users with the global social network is advocated to avoid over-concentration on high-degree nodes in graph neural networks. Finally, the proposed model is experimentally evaluated with two publicly Twitter datasets and demonstrates that the proposed long-tail strategy significantly improves the effectiveness of existing graph-based methods in terms of detecting misinformation. The robustness of BLC has also been examined on three graph datasets and demonstrates that it consistently outperforms traditional algorithms when perturbed by 15% of a dataset.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"12 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139094847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Oana Balalau, Francesco Bonchi, T-H. Hubert Chan, Francesco Gullo, Mauro Sozio, Hao Xie
Finding dense subgraphs in large (hyper)graphs is a key primitive in a variety of real-world application domains, encompassing social network analytics, event detection, biology, and finance. In most such applications, one typically aims at finding several (possibly overlapping) dense subgraphs which might correspond to communities in social networks or interesting events. While a large amount of work is devoted to finding a single densest subgraph, perhaps surprisingly, the problem of finding several dense subgraphs in weighted hypergraphs with limited overlap has not been studied in a principled way, to the best of our knowledge. In this work we define and study a natural generalization of the densest subgraph problem in weighted hypergraphs, where the main goal is to find at most k subgraphs with maximum total aggregate density, while satisfying an upper bound on the pairwise weighted Jaccard coefficient, i.e., the ratio of weights of intersection divided by weights of union on two nodes sets of the subgraphs. After showing that such a problem is NP-Hard, we devise an efficient algorithm that comes with provable guarantees in some cases of interest, as well as, an efficient practical heuristic. Our extensive evaluation on large real-world hypergraphs confirms the efficiency and effectiveness of our algorithms.
在大型(超)图中寻找稠密子图是现实世界中各种应用领域的关键基础,包括社交网络分析、事件检测、生物学和金融学。在大多数此类应用中,人们的目标通常是找到几个(可能重叠的)稠密子图,这些子图可能与社交网络中的社区或有趣的事件相对应。虽然大量工作致力于寻找单个最密集子图,但据我们所知,在重叠有限的加权超图中寻找多个密集子图的问题还没有得到原则性的研究,这或许令人惊讶。在这项工作中,我们定义并研究了加权超图中最密子图问题的自然概括,其主要目标是找到最多具有最大总密度的 k 个子图,同时满足成对加权 Jaccard 系数的上限,即子图中两个节点集的相交权重除以结合权重的比值。在证明这个问题是 NP-Hard(近乎困难)之后,我们设计了一种高效算法,该算法在某些感兴趣的情况下具有可证明的保证,同时也是一种高效实用的启发式算法。我们在大型真实超图上进行的广泛评估证实了我们算法的效率和有效性。
{"title":"Finding Subgraphs with Maximum Total Density and Limited Overlap in Weighted Hypergraphs","authors":"Oana Balalau, Francesco Bonchi, T-H. Hubert Chan, Francesco Gullo, Mauro Sozio, Hao Xie","doi":"10.1145/3639410","DOIUrl":"https://doi.org/10.1145/3639410","url":null,"abstract":"<p>Finding dense subgraphs in large (hyper)graphs is a key primitive in a variety of real-world application domains, encompassing social network analytics, event detection, biology, and finance. In most such applications, one typically aims at finding several (possibly overlapping) dense subgraphs which might correspond to communities in social networks or interesting events. While a large amount of work is devoted to finding a single densest subgraph, perhaps surprisingly, the problem of finding several dense subgraphs in weighted hypergraphs with limited overlap has not been studied in a principled way, to the best of our knowledge. In this work we define and study a natural generalization of the densest subgraph problem in weighted hypergraphs, where the main goal is to find at most <i>k</i> subgraphs with maximum total aggregate density, while satisfying an upper bound on the pairwise weighted Jaccard coefficient, i.e., the ratio of weights of intersection divided by weights of union on two nodes sets of the subgraphs. After showing that such a problem is NP-Hard, we devise an efficient algorithm that comes with provable guarantees in some cases of interest, as well as, an efficient practical heuristic. Our extensive evaluation on large real-world hypergraphs confirms the efficiency and effectiveness of our algorithms.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"21 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139096492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jinpeng Li, Hang Yu, zhenyuzhang, Xiangfeng Luo, Shaorong Xie
Concept drift is a phenomenon where the distribution of data streams changes over time. When this happens, model predictions become less accurate. Hence, models built in the past need to be re-learned for the current data. Two design questions need to be addressed in designing a strategy to re-learn models: which type of concept drift has occurred, and how to utilize the drift type to improve re-learning performance. Existing drift detection methods are often good at determining when drift has occurred. However, few retrieve information about how the drift came to be present in the stream. Hence, determining the impact of the type of drift on adaptation is difficult. Filling this gap, we designed a framework based on a lazy strategy called Type-Driven Lazy Drift Adaptor (Type-LDA). Type-LDA first retrieves information about both how and when a drift has occurred, then it uses this information to re-learn the new model. To identify the type of drift, a drift type identifier is pre-trained on synthetic data of known drift types. Further, a drift point locator locates the optimal point of drift via a sharing loss. Hence, Type-LDA can select the optimal point, according to the drift type, to re-learn the new model. Experiments validate Type-LDA on both synthetic data and real-world data, and the results show that accurately identifying drift type can improve adaptation accuracy.
{"title":"Concept Drift Adaptation by Exploiting Drift Type","authors":"Jinpeng Li, Hang Yu, zhenyuzhang, Xiangfeng Luo, Shaorong Xie","doi":"10.1145/3638777","DOIUrl":"https://doi.org/10.1145/3638777","url":null,"abstract":"<p>Concept drift is a phenomenon where the distribution of data streams changes over time. When this happens, model predictions become less accurate. Hence, models built in the past need to be re-learned for the current data. Two design questions need to be addressed in designing a strategy to re-learn models: which type of concept drift has occurred, and how to utilize the drift type to improve re-learning performance. Existing drift detection methods are often good at determining when drift has occurred. However, few retrieve information about how the drift came to be present in the stream. Hence, determining the impact of the type of drift on adaptation is difficult. Filling this gap, we designed a framework based on a lazy strategy called Type-Driven Lazy Drift Adaptor (Type-LDA). Type-LDA first retrieves information about both how and when a drift has occurred, then it uses this information to re-learn the new model. To identify the type of drift, a drift type identifier is pre-trained on synthetic data of known drift types. Further, a drift point locator locates the optimal point of drift via a sharing loss. Hence, Type-LDA can select the optimal point, according to the drift type, to re-learn the new model. Experiments validate Type-LDA on both synthetic data and real-world data, and the results show that accurately identifying drift type can improve adaptation accuracy.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"23 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139373231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jingyi Cui, Guangquan Xu, Jian Liu, Shicheng Feng, Jianli Wang, Hao Peng, Shihui Fu, Zhaohua Zheng, Xi Zheng, Shaoying Liu
Recommendation systems powered by AI are widely used to improve user experience. However, it inevitably raises privacy leakage and other security issues due to the utilization of extensive user data. Addressing these challenges can protect users’ personal information, benefit service providers, and foster service ecosystems. Presently, numerous techniques based on differential privacy have been proposed to solve this problem. However, existing solutions encounter issues such as inadequate data utilization and an tenuous trade-off between privacy protection and recommendation effectiveness. To enhance recommendation accuracy and protect users’ private data, we propose ID-SR, a novel privacy-preserving social recommendation scheme for trustworthy AI based on the infinite divisibility of Laplace distribution. We first introduce a novel recommendation method adopted in ID-SR, which is established based on matrix factorization with a newly designed social regularization term for improving recommendation effectiveness. Additionally, we propose a differential privacy preserving scheme tailored to the above method that leverages the Laplace distribution’s characteristics to safeguard user data. Theoretical analysis and experimentation evaluation on two publicly available datasets demonstrate that our scheme achieves a superior balance between privacy protection and recommendation effectiveness, ultimately delivering an enhanced user experience.
{"title":"ID-SR: Privacy-Preserving Social Recommendation based on Infinite Divisibility for Trustworthy AI","authors":"Jingyi Cui, Guangquan Xu, Jian Liu, Shicheng Feng, Jianli Wang, Hao Peng, Shihui Fu, Zhaohua Zheng, Xi Zheng, Shaoying Liu","doi":"10.1145/3639412","DOIUrl":"https://doi.org/10.1145/3639412","url":null,"abstract":"<p>Recommendation systems powered by AI are widely used to improve user experience. However, it inevitably raises privacy leakage and other security issues due to the utilization of extensive user data. Addressing these challenges can protect users’ personal information, benefit service providers, and foster service ecosystems. Presently, numerous techniques based on differential privacy have been proposed to solve this problem. However, existing solutions encounter issues such as inadequate data utilization and an tenuous trade-off between privacy protection and recommendation effectiveness. To enhance recommendation accuracy and protect users’ private data, we propose ID-SR, a novel privacy-preserving social recommendation scheme for trustworthy AI based on the infinite divisibility of Laplace distribution. We first introduce a novel recommendation method adopted in ID-SR, which is established based on matrix factorization with a newly designed social regularization term for improving recommendation effectiveness. Additionally, we propose a differential privacy preserving scheme tailored to the above method that leverages the Laplace distribution’s characteristics to safeguard user data. Theoretical analysis and experimentation evaluation on two publicly available datasets demonstrate that our scheme achieves a superior balance between privacy protection and recommendation effectiveness, ultimately delivering an enhanced user experience.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"26 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139094844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinjiao Li, Guowei Wu, Lin Yao, Zhaolong Zheng, Shisong Geng
Data perturbation under differential privacy constraint is an important approach of protecting data privacy. However, as the data dimensions increase, the privacy budget allocated to each dimension decreases and thus the amount of noise added increases, which eventually leads to lower data utility in training tasks. To protect the privacy of training data while enhancing data utility, we propose an Utility-aware training data Privacy Perturbation scheme based on attribute Partition and budget Allocation (UPPPA). UPPPA includes three procedures, the quantification of attribute privacy and attribute importance, attribute partition, and budget allocation. The quantification of attribute privacy and attribute importance based on information entropy and attribute correlation provide an arithmetic basis for attribute partition and budget allocation. During the attribute partition, all attributes of training data are classified into high and low classes to achieve privacy amplification and utility enhancement. During the budget allocation, a γ-privacy model is proposed to balance data privacy and data utility so as to provide privacy constraint and guide budget allocation. Three comprehensive sets of real-world data are applied to evaluate the performance of UPPPA. Experiments and privacy analysis show that our scheme can achieve the tradeoff between privacy and utility.
{"title":"Utility-aware Privacy Perturbation for Training Data","authors":"Xinjiao Li, Guowei Wu, Lin Yao, Zhaolong Zheng, Shisong Geng","doi":"10.1145/3639411","DOIUrl":"https://doi.org/10.1145/3639411","url":null,"abstract":"<p>Data perturbation under differential privacy constraint is an important approach of protecting data privacy. However, as the data dimensions increase, the privacy budget allocated to each dimension decreases and thus the amount of noise added increases, which eventually leads to lower data utility in training tasks. To protect the privacy of training data while enhancing data utility, we propose an Utility-aware training data Privacy Perturbation scheme based on attribute Partition and budget Allocation (UPPPA). UPPPA includes three procedures, the quantification of attribute privacy and attribute importance, attribute partition, and budget allocation. The quantification of attribute privacy and attribute importance based on information entropy and attribute correlation provide an arithmetic basis for attribute partition and budget allocation. During the attribute partition, all attributes of training data are classified into high and low classes to achieve privacy amplification and utility enhancement. During the budget allocation, a <i>γ</i>-privacy model is proposed to balance data privacy and data utility so as to provide privacy constraint and guide budget allocation. Three comprehensive sets of real-world data are applied to evaluate the performance of UPPPA. Experiments and privacy analysis show that our scheme can achieve the tradeoff between privacy and utility.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"2 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139096533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Topic modelling is a beneficial technique used to discover latent topics in text collections. But to correctly understand the text content and generate a meaningful topic list, semantics are important. By ignoring semantics, that is, not attempting to grasp the meaning of the words, most of the existing topic modelling approaches can generate some meaningless topic words. Even existing semantic-based approaches usually interpret the meanings of words without considering the context and related words. In this paper, we introduce a semantic-based topic model called semantic-LDA which captures the semantics of words in a text collection using concepts from an external ontology. A new method is introduced to identify and quantify the concept–word relationships based on matching words from the input text collection with concepts from an ontology without using pre-calculated values from the ontology that quantify the relationships between the words and concepts. These pre-calculated values may not reflect the actual relationships between words and concepts for the input collection because they are derived from datasets used to build the ontology rather than from the input collection itself. Instead, quantifying the relationship based on the word distribution in the input collection is more realistic and beneficial in the semantic capture process. Furthermore, an ambiguity handling mechanism is introduced to interpret the unmatched words, that is, words for which there are no matching concepts in the ontology. Thus, this paper makes a significant contribution by introducing a semantic-based topic model which calculates the word–concept relationships directly from the input text collection. The proposed semantic-based topic model and an enhanced version with the disambiguation mechanism were evaluated against a set of state-of-the-art systems, and our approaches outperformed the baseline systems in both topic quality and information filtering evaluations.
{"title":"A Semantics-enhanced Topic Modelling Technique: Semantic-LDA","authors":"Dakshi Kapugama Geeganage, Yue Xu, Yuefeng Li","doi":"10.1145/3639409","DOIUrl":"https://doi.org/10.1145/3639409","url":null,"abstract":"<p>Topic modelling is a beneficial technique used to discover latent topics in text collections. But to correctly understand the text content and generate a meaningful topic list, semantics are important. By ignoring semantics, that is, not attempting to grasp the meaning of the words, most of the existing topic modelling approaches can generate some meaningless topic words. Even existing semantic-based approaches usually interpret the meanings of words without considering the context and related words. In this paper, we introduce a semantic-based topic model called semantic-LDA which captures the semantics of words in a text collection using concepts from an external ontology. A new method is introduced to identify and quantify the concept–word relationships based on matching words from the input text collection with concepts from an ontology without using pre-calculated values from the ontology that quantify the relationships between the words and concepts. These pre-calculated values may not reflect the actual relationships between words and concepts for the input collection because they are derived from datasets used to build the ontology rather than from the input collection itself. Instead, quantifying the relationship based on the word distribution in the input collection is more realistic and beneficial in the semantic capture process. Furthermore, an ambiguity handling mechanism is introduced to interpret the unmatched words, that is, words for which there are no matching concepts in the ontology. Thus, this paper makes a significant contribution by introducing a semantic-based topic model which calculates the word–concept relationships directly from the input text collection. The proposed semantic-based topic model and an enhanced version with the disambiguation mechanism were evaluated against a set of state-of-the-art systems, and our approaches outperformed the baseline systems in both topic quality and information filtering evaluations.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"11 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139376284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multiple-instance learning (MIL) solves the problem where training instances are grouped in bags, and a binary (positive or negative) label is provided for each bag. Most of the existing MIL studies need fully labeled bags for training an effective classifier, while it could be quite hard to collect such data in many real-world scenarios, due to the high cost of data labeling process. Fortunately, unlike fully labeled data, triplet comparison data can be collected in a more accurate and human-friendly way. Therefore, in this paper, we for the first time investigate MIL from only triplet comparison bags, where a triplet (Xa, Xb, Xc) contains the weak supervision information that bag Xa is more similar to Xb than to Xc. To solve this problem, we propose to train a bag-level classifier by the empirical risk minimization framework and theoretically provide a generalization error bound. We also show that a convex formulation can be obtained only when specific convex binary losses such as the square loss and the double hinge loss are used. Extensive experiments validate that our proposed method significantly outperforms other baselines.
多实例学习(Multiple-instance Learning,MIL)解决的问题是将训练实例分组为袋,并为每个袋提供二元(正或负)标签。现有的大多数 MIL 研究都需要完全标记的袋来训练有效的分类器,而由于数据标记过程的成本较高,在现实世界的许多场景中可能很难收集到这样的数据。幸运的是,与完全标记数据不同,三元组比较数据可以以更准确、更人性化的方式收集。因此,在本文中,我们首次研究了仅来自三元组比较袋的 MIL,其中三元组(Xa, Xb, Xc)包含弱监督信息,即袋 Xa 与 Xb 的相似度高于与 Xc 的相似度。为了解决这个问题,我们建议通过经验风险最小化框架来训练袋级分类器,并从理论上提供了泛化误差约束。我们还证明,只有在使用特定的凸二元损失(如平方损失和双铰链损失)时,才能获得凸表述。大量实验验证了我们提出的方法明显优于其他基线方法。
{"title":"Multiple-Instance Learning from Triplet Comparison Bags","authors":"Senlin Shu, Deng-Bao Wang, Suqin Yuan, Hongxin Wei, Jiuchuan Jiang, Lei Feng, Min-Ling Zhang","doi":"10.1145/3638776","DOIUrl":"https://doi.org/10.1145/3638776","url":null,"abstract":"<p><i>Multiple-instance learning</i> (MIL) solves the problem where training instances are grouped in bags, and a binary (positive or negative) label is provided for each bag. Most of the existing MIL studies need fully labeled bags for training an effective classifier, while it could be quite hard to collect such data in many real-world scenarios, due to the high cost of data labeling process. Fortunately, unlike fully labeled data, <i>triplet comparison data</i> can be collected in a more accurate and human-friendly way. Therefore, in this paper, we for the first time investigate MIL from <i>only triplet comparison bags</i>, where a triplet (<i>X<sub>a</sub></i>, <i>X<sub>b</sub></i>, <i>X<sub>c</sub></i>) contains the weak supervision information that bag <i>X<sub>a</sub></i> is more similar to <i>X<sub>b</sub></i> than to <i>X<sub>c</sub></i>. To solve this problem, we propose to train a bag-level classifier by the <i>empirical risk minimization</i> framework and theoretically provide a generalization error bound. We also show that a convex formulation can be obtained only when specific convex binary losses such as the square loss and the double hinge loss are used. Extensive experiments validate that our proposed method significantly outperforms other baselines.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"30 8","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139094801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mehak Khan, Gustavo B. M. Mello, Laurence Habib, Paal Engelstad, Anis Yazidi
In this paper, we present a new propagation paradigm based on the principle of Hyperlink-Induced Topic Search (HITS) algorithm. The HITS algorithm utilizes the concept of a ”self-reinforcing” relationship of authority-hub. Using HITS, the centrality of nodes is determined via repeated updates of authority-hub scores that converge to a stationary distribution. Unlike PageRank-based propagation methods, which rely solely on the idea of authorities (in-links), HITS considers the relevance of both authorities (in-links) and hubs (out-links), thereby allowing for a more informative graph learning process. To segregate node prediction and propagation, we use a Multilayer Perceptron (MLP) in combination with a HITS-based propagation approach and propose two models; HITS-GNN and HITS-GNN+. We provided additional validation of our models’ efficacy by performing an ablation study to assess the performance of authority-hub in independent models. Moreover, the effect of the main hyper-parameters and normalization is also analyzed to uncover how these techniques influence the performance of our models. Extensive experimental results indicate that the proposed approach significantly improves baseline methods on the graph (citation network) benchmark datasets by a decent margin for semi-supervised node classification, which can aid in predicting the categories (labels) of scientific articles not exclusively based on their content but also based on the type of articles they cite.
{"title":"HITS based Propagation Paradigm for Graph Neural Networks","authors":"Mehak Khan, Gustavo B. M. Mello, Laurence Habib, Paal Engelstad, Anis Yazidi","doi":"10.1145/3638779","DOIUrl":"https://doi.org/10.1145/3638779","url":null,"abstract":"<p>In this paper, we present a new propagation paradigm based on the principle of Hyperlink-Induced Topic Search (HITS) algorithm. The HITS algorithm utilizes the concept of a ”self-reinforcing” relationship of authority-hub. Using HITS, the centrality of nodes is determined via repeated updates of authority-hub scores that converge to a stationary distribution. Unlike PageRank-based propagation methods, which rely solely on the idea of authorities (in-links), HITS considers the relevance of both authorities (in-links) and hubs (out-links), thereby allowing for a more informative graph learning process. To segregate node prediction and propagation, we use a Multilayer Perceptron (MLP) in combination with a HITS-based propagation approach and propose two models; HITS-GNN and HITS-GNN+. We provided additional validation of our models’ efficacy by performing an ablation study to assess the performance of authority-hub in independent models. Moreover, the effect of the main hyper-parameters and normalization is also analyzed to uncover how these techniques influence the performance of our models. Extensive experimental results indicate that the proposed approach significantly improves baseline methods on the graph (citation network) benchmark datasets by a decent margin for semi-supervised node classification, which can aid in predicting the categories (labels) of scientific articles not exclusively based on their content but also based on the type of articles they cite.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"33 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2023-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139068613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Network modeling has been explored extensively by means of theoretical analysis as well as numerical simulations for Network Reconstruction (NR). The network reconstruction problem requires the estimation of the power-law exponent (γ) of a given input network. Thus, the effectiveness of the NR solution depends on the accuracy of the calculation of γ. In this article, we re-examine the degree distribution-based estimation of γ, which is not very accurate due to approximations. We propose X-distribution, which is more accurate as compared to degree distribution. Various state-of-the-art network models, including CPM, NRM, RefOrCite2, BA, CDPAM, and DMS, are considered for simulation purposes, and simulated results support the proposed claim. Further, we apply X-distribution over several real-world networks to calculate their power-law exponents, which differ from those calculated using respective degree distributions. It is observed that X-distributions exhibit more linearity (straight line) on the log-log scale as compared to degree distributions. Thus, X-distribution is more suitable for the evaluation of power-law exponent using linear fitting (on the log-log scale). The MATLAB implementation of power-law exponent (γ) calculation using X-distribution for different network models, and the real-world datasets used in our experiments are available here: https://github.com/Aikta-Arya/X-distribution-Retraceable-Power-Law-Exponent-of-Complex-Networks.git
通过理论分析和网络重建(NR)的数值模拟,人们对网络建模进行了广泛的探索。网络重构问题需要估计给定输入网络的幂律指数(γ)。因此,NR 解决方案的有效性取决于 γ 计算的准确性。在本文中,我们重新审视了基于阶数分布的 γ 估计方法,由于存在近似值,该方法的准确性不高。我们提出了 X 分布,它比度分布更准确。为了模拟目的,我们考虑了各种最先进的网络模型,包括 CPM、NRM、RefOrCite2、BA、CDPAM 和 DMS。此外,我们将 X 分布应用于几个真实世界的网络,计算出它们的幂律指数,这些指数与使用各自的度分布计算出的指数不同。与学位分布相比,X 分布在对数尺度上表现出更多的线性(直线)。因此,X 分布更适合使用线性拟合(对数-对数尺度)来评估幂律指数。针对不同网络模型使用 X 分布计算幂律指数(γ)的 MATLAB 实现,以及我们实验中使用的真实世界数据集,可在此处获取: https://github.com/Aikta-Arya/X-distribution-Retraceable-Power-Law-Exponent-of-Complex-Networks.git
{"title":"X-distribution: Retraceable Power-Law Exponent of Complex Networks","authors":"Pradumn Kumar Pandey, Aikta Arya, Akrati Saxena","doi":"10.1145/3639413","DOIUrl":"https://doi.org/10.1145/3639413","url":null,"abstract":"<p>Network modeling has been explored extensively by means of theoretical analysis as well as numerical simulations for Network Reconstruction (NR). The network reconstruction problem requires the estimation of the power-law exponent (<i>γ</i>) of a given input network. Thus, the effectiveness of the NR solution depends on the accuracy of the calculation of <i>γ</i>. In this article, we re-examine the degree distribution-based estimation of <i>γ</i>, which is not very accurate due to approximations. We propose <b>X</b>-distribution, which is more accurate as compared to degree distribution. Various state-of-the-art network models, including CPM, NRM, RefOrCite2, BA, CDPAM, and DMS, are considered for simulation purposes, and simulated results support the proposed claim. Further, we apply <b>X</b>-distribution over several real-world networks to calculate their power-law exponents, which differ from those calculated using respective degree distributions. It is observed that <b>X</b>-distributions exhibit more linearity (straight line) on the log-log scale as compared to degree distributions. Thus, <b>X</b>-distribution is more suitable for the evaluation of power-law exponent using linear fitting (on the log-log scale). The MATLAB implementation of power-law exponent (<i>γ</i>) calculation using <b>X</b>-distribution for different network models, and the real-world datasets used in our experiments are available here: https://github.com/Aikta-Arya/X-distribution-Retraceable-Power-Law-Exponent-of-Complex-Networks.git\u0000</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"6 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2023-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139068412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cross-lingual entity alignment (CLEA) aims to find equivalent entity pairs between knowledge graphs (KG) in different languages. It is an important way to connect heterogeneous KGs and facilitate knowledge completion. Existing methods have found that incorporating relations into entities can effectively improve KG representation and benefit entity alignment, and these methods learn relation representation depending on entities, which cannot capture the diverse structures of relations. However, multiple relations in KG form diverse structures, such as adjacency structure and ring structure. This diversity of relation structures makes the relation representation challenging. Therefore, we propose to construct the weighted line graphs to model the diverse structures of relations and learn relation representation independently from entities. Especially, owing to the diversity of adjacency structures and ring structures, we propose to construct adjacency line graph and ring line graph respectively to model the structures of relations and to further improve entity representation. In addition, to alleviate the hubness problem in alignment, we introduce the optimal transport into alignment and compute the distance matrix in a different way. From a global perspective, we calculate the optimal 1-to-1 alignment bi-directionally to improve the alignment accuracy. Experimental results on two benchmark datasets show that our proposed method significantly outperforms state-of-the-art CLEA methods in both supervised and unsupervised manners.
{"title":"Diverse Structure-aware Relation Representation in Cross-Lingual Entity Alignment","authors":"Yuhong Zhang, Jianqing Wu, Kui Yu, Xindong Wu","doi":"10.1145/3638778","DOIUrl":"https://doi.org/10.1145/3638778","url":null,"abstract":"<p>Cross-lingual entity alignment (CLEA) aims to find equivalent entity pairs between knowledge graphs (KG) in different languages. It is an important way to connect heterogeneous KGs and facilitate knowledge completion. Existing methods have found that incorporating relations into entities can effectively improve KG representation and benefit entity alignment, and these methods learn relation representation depending on entities, which cannot capture the diverse structures of relations. However, multiple relations in KG form diverse structures, such as adjacency structure and ring structure. This diversity of relation structures makes the relation representation challenging. Therefore, we propose to construct the weighted line graphs to model the diverse structures of relations and learn relation representation independently from entities. Especially, owing to the diversity of adjacency structures and ring structures, we propose to construct adjacency line graph and ring line graph respectively to model the structures of relations and to further improve entity representation. In addition, to alleviate the hubness problem in alignment, we introduce the optimal transport into alignment and compute the distance matrix in a different way. From a global perspective, we calculate the optimal 1-to-1 alignment bi-directionally to improve the alignment accuracy. Experimental results on two benchmark datasets show that our proposed method significantly outperforms state-of-the-art CLEA methods in both supervised and unsupervised manners.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"1 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2023-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139068366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}