{"title":"LATEX-GCL:基于大语言模型(LLMs)的文本归因图对比学习数据扩展","authors":"Haoran Yang, Xiangyu Zhao, Sirui Huang, Qing Li, Guandong Xu","doi":"arxiv-2409.01145","DOIUrl":null,"url":null,"abstract":"Graph Contrastive Learning (GCL) is a potent paradigm for self-supervised\ngraph learning that has attracted attention across various application\nscenarios. However, GCL for learning on Text-Attributed Graphs (TAGs) has yet\nto be explored. Because conventional augmentation techniques like feature\nembedding masking cannot directly process textual attributes on TAGs. A naive\nstrategy for applying GCL to TAGs is to encode the textual attributes into\nfeature embeddings via a language model and then feed the embeddings into the\nfollowing GCL module for processing. Such a strategy faces three key\nchallenges: I) failure to avoid information loss, II) semantic loss during the\ntext encoding phase, and III) implicit augmentation constraints that lead to\nuncontrollable and incomprehensible results. In this paper, we propose a novel\nGCL framework named LATEX-GCL to utilize Large Language Models (LLMs) to\nproduce textual augmentations and LLMs' powerful natural language processing\n(NLP) abilities to address the three limitations aforementioned to pave the way\nfor applying GCL to TAG tasks. Extensive experiments on four high-quality TAG\ndatasets illustrate the superiority of the proposed LATEX-GCL method. The\nsource codes and datasets are released to ease the reproducibility, which can\nbe accessed via this link: https://anonymous.4open.science/r/LATEX-GCL-0712.","PeriodicalId":501032,"journal":{"name":"arXiv - CS - Social and Information Networks","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"LATEX-GCL: Large Language Models (LLMs)-Based Data Augmentation for Text-Attributed Graph Contrastive Learning\",\"authors\":\"Haoran Yang, Xiangyu Zhao, Sirui Huang, Qing Li, Guandong Xu\",\"doi\":\"arxiv-2409.01145\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Graph Contrastive Learning (GCL) is a potent paradigm for self-supervised\\ngraph learning that has attracted attention across various application\\nscenarios. However, GCL for learning on Text-Attributed Graphs (TAGs) has yet\\nto be explored. Because conventional augmentation techniques like feature\\nembedding masking cannot directly process textual attributes on TAGs. A naive\\nstrategy for applying GCL to TAGs is to encode the textual attributes into\\nfeature embeddings via a language model and then feed the embeddings into the\\nfollowing GCL module for processing. Such a strategy faces three key\\nchallenges: I) failure to avoid information loss, II) semantic loss during the\\ntext encoding phase, and III) implicit augmentation constraints that lead to\\nuncontrollable and incomprehensible results. In this paper, we propose a novel\\nGCL framework named LATEX-GCL to utilize Large Language Models (LLMs) to\\nproduce textual augmentations and LLMs' powerful natural language processing\\n(NLP) abilities to address the three limitations aforementioned to pave the way\\nfor applying GCL to TAG tasks. Extensive experiments on four high-quality TAG\\ndatasets illustrate the superiority of the proposed LATEX-GCL method. The\\nsource codes and datasets are released to ease the reproducibility, which can\\nbe accessed via this link: https://anonymous.4open.science/r/LATEX-GCL-0712.\",\"PeriodicalId\":501032,\"journal\":{\"name\":\"arXiv - CS - Social and Information Networks\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Social and Information Networks\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.01145\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Social and Information Networks","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.01145","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
图对比学习(GCL)是一种有效的自监督图学习范式,在各种应用场景中都备受关注。然而,用于文本属性图(TAG)学习的 GCL 还有待探索。因为传统的增强技术(如特征嵌入屏蔽)无法直接处理 TAG 上的文本属性。将 GCL 应用于 TAG 的一种原始策略是通过语言模型将文本属性编码为特征嵌入,然后将嵌入输入到后续的 GCL 模块中进行处理。这种策略面临三个主要挑战:I) 无法避免信息丢失;II) 文本编码阶段的语义丢失;III) 隐式扩增约束导致结果难以控制和理解。在本文中,我们提出了一种名为 LATEX-GCL 的新型 GCL 框架,利用大语言模型(LLM)生成文本增强,并利用 LLM 强大的自然语言处理(NLP)能力来解决上述三个局限性,从而为将 GCL 应用于 TAG 任务铺平道路。在四个高质量 TAG 数据集上进行的广泛实验证明了所提出的 LATEX-GCL 方法的优越性。为了便于重现,我们发布了源代码和数据集,可通过以下链接访问:https://anonymous.4open.science/r/LATEX-GCL-0712。
LATEX-GCL: Large Language Models (LLMs)-Based Data Augmentation for Text-Attributed Graph Contrastive Learning
Graph Contrastive Learning (GCL) is a potent paradigm for self-supervised
graph learning that has attracted attention across various application
scenarios. However, GCL for learning on Text-Attributed Graphs (TAGs) has yet
to be explored. Because conventional augmentation techniques like feature
embedding masking cannot directly process textual attributes on TAGs. A naive
strategy for applying GCL to TAGs is to encode the textual attributes into
feature embeddings via a language model and then feed the embeddings into the
following GCL module for processing. Such a strategy faces three key
challenges: I) failure to avoid information loss, II) semantic loss during the
text encoding phase, and III) implicit augmentation constraints that lead to
uncontrollable and incomprehensible results. In this paper, we propose a novel
GCL framework named LATEX-GCL to utilize Large Language Models (LLMs) to
produce textual augmentations and LLMs' powerful natural language processing
(NLP) abilities to address the three limitations aforementioned to pave the way
for applying GCL to TAG tasks. Extensive experiments on four high-quality TAG
datasets illustrate the superiority of the proposed LATEX-GCL method. The
source codes and datasets are released to ease the reproducibility, which can
be accessed via this link: https://anonymous.4open.science/r/LATEX-GCL-0712.