{"title":"TransVAE-PAM:基于变压器和 DAG 的组合方法,用于增强印度背景下的假新闻检测能力","authors":"Shivani Tufchi, Tanveer Ahmed, Ashima Yadav, Krishna Kant Agrawal, Ankit Vidyarthi","doi":"10.1145/3651160","DOIUrl":null,"url":null,"abstract":"<p>In this study, we introduce a novel method, “TransVAE-PAM”, for the classification of fake news articles, tailored specifically for the Indian context. The approach capitalizes on state-of-the-art contextual and sentence transformer-based embedding models to generate article embeddings. Furthermore, we also try to address the issue of compact model size. In this respect, we employ a Variational Autoencoder (VAE) and <i>β</i>-VAE to reduce the dimensions of the embeddings, thereby yielding compact latent representations. To capture the thematic essence or important topics in the news articles, we use the Pachinko Allocation Model (PAM) model, a Directed Acyclic Graph (DAG) based approach, to generate meaningful topics. These two facets of representation - the reduced-dimension embeddings from the VAE and the extracted topics from the PAM model - are fused together to create a feature set. This representation is subsequently channeled into five different methods for fake news classification. Furthermore, we use eight distinct transformer-based architectures to test the embedding generation. To validate the feasibility of the proposed approach, we have conducted extensive experimentation on a proprietary dataset. The dataset is sourced from “Times of India” and other online media. Considering the size of the dataset, large-scale experiments are conducted on an NVIDIA supercomputer. Through this comprehensive numerical investigation, we have achieved an accuracy of 96.2% and an F1 score of 96% using the DistilBERT transformer architecture. By complementing the method via topic modeling, we record a performance improvement with the accuracy and F1 score both at 97%. These results indicate a promising direction toward leveraging the combination of advanced topic models into existing classification schemes to enhance research on fake news detection.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":1.8000,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TransVAE-PAM: A Combined Transformer and DAG-based Approach for Enhanced Fake News Detection in Indian Context\",\"authors\":\"Shivani Tufchi, Tanveer Ahmed, Ashima Yadav, Krishna Kant Agrawal, Ankit Vidyarthi\",\"doi\":\"10.1145/3651160\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>In this study, we introduce a novel method, “TransVAE-PAM”, for the classification of fake news articles, tailored specifically for the Indian context. The approach capitalizes on state-of-the-art contextual and sentence transformer-based embedding models to generate article embeddings. Furthermore, we also try to address the issue of compact model size. In this respect, we employ a Variational Autoencoder (VAE) and <i>β</i>-VAE to reduce the dimensions of the embeddings, thereby yielding compact latent representations. To capture the thematic essence or important topics in the news articles, we use the Pachinko Allocation Model (PAM) model, a Directed Acyclic Graph (DAG) based approach, to generate meaningful topics. These two facets of representation - the reduced-dimension embeddings from the VAE and the extracted topics from the PAM model - are fused together to create a feature set. This representation is subsequently channeled into five different methods for fake news classification. Furthermore, we use eight distinct transformer-based architectures to test the embedding generation. To validate the feasibility of the proposed approach, we have conducted extensive experimentation on a proprietary dataset. The dataset is sourced from “Times of India” and other online media. Considering the size of the dataset, large-scale experiments are conducted on an NVIDIA supercomputer. Through this comprehensive numerical investigation, we have achieved an accuracy of 96.2% and an F1 score of 96% using the DistilBERT transformer architecture. By complementing the method via topic modeling, we record a performance improvement with the accuracy and F1 score both at 97%. These results indicate a promising direction toward leveraging the combination of advanced topic models into existing classification schemes to enhance research on fake news detection.</p>\",\"PeriodicalId\":54312,\"journal\":{\"name\":\"ACM Transactions on Asian and Low-Resource Language Information Processing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2024-03-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Asian and Low-Resource Language Information Processing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3651160\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Asian and Low-Resource Language Information Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3651160","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
摘要
在本研究中,我们介绍了一种新方法 "TransVAE-PAM",用于对假新闻文章进行分类,该方法专门针对印度的情况而定制。该方法利用最先进的基于上下文和句子转换器的嵌入模型来生成文章嵌入。此外,我们还尝试解决模型尺寸紧凑的问题。在这方面,我们采用了变异自动编码器(VAE)和 β-VAE 来减少嵌入的维度,从而生成紧凑的潜在表示。为了捕捉新闻文章中的主题本质或重要话题,我们使用了基于有向无环图(DAG)的柏青柯分配模型(PAM)来生成有意义的话题。这两方面的表征--来自 VAE 的降维嵌入和来自 PAM 模型的提取主题--被融合在一起以创建一个特征集。这一表征随后被导入五种不同的假新闻分类方法中。此外,我们还使用了八种不同的基于变换器的架构来测试嵌入生成。为了验证所提方法的可行性,我们在一个专有数据集上进行了广泛的实验。该数据集来自《印度时报》和其他网络媒体。考虑到数据集的规模,我们在英伟达超级计算机上进行了大规模实验。通过全面的数值研究,我们利用 DistilBERT 变换器架构实现了 96.2% 的准确率和 96% 的 F1 分数。通过主题建模对该方法进行补充,我们的准确率和 F1 分数均达到了 97%,性能得到了提高。这些结果表明,将先进的话题模型与现有的分类方案相结合,加强假新闻检测研究是一个很有前景的方向。
TransVAE-PAM: A Combined Transformer and DAG-based Approach for Enhanced Fake News Detection in Indian Context
In this study, we introduce a novel method, “TransVAE-PAM”, for the classification of fake news articles, tailored specifically for the Indian context. The approach capitalizes on state-of-the-art contextual and sentence transformer-based embedding models to generate article embeddings. Furthermore, we also try to address the issue of compact model size. In this respect, we employ a Variational Autoencoder (VAE) and β-VAE to reduce the dimensions of the embeddings, thereby yielding compact latent representations. To capture the thematic essence or important topics in the news articles, we use the Pachinko Allocation Model (PAM) model, a Directed Acyclic Graph (DAG) based approach, to generate meaningful topics. These two facets of representation - the reduced-dimension embeddings from the VAE and the extracted topics from the PAM model - are fused together to create a feature set. This representation is subsequently channeled into five different methods for fake news classification. Furthermore, we use eight distinct transformer-based architectures to test the embedding generation. To validate the feasibility of the proposed approach, we have conducted extensive experimentation on a proprietary dataset. The dataset is sourced from “Times of India” and other online media. Considering the size of the dataset, large-scale experiments are conducted on an NVIDIA supercomputer. Through this comprehensive numerical investigation, we have achieved an accuracy of 96.2% and an F1 score of 96% using the DistilBERT transformer architecture. By complementing the method via topic modeling, we record a performance improvement with the accuracy and F1 score both at 97%. These results indicate a promising direction toward leveraging the combination of advanced topic models into existing classification schemes to enhance research on fake news detection.
期刊介绍:
The ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) publishes high quality original archival papers and technical notes in the areas of computation and processing of information in Asian languages, low-resource languages of Africa, Australasia, Oceania and the Americas, as well as related disciplines. The subject areas covered by TALLIP include, but are not limited to:
-Computational Linguistics: including computational phonology, computational morphology, computational syntax (e.g. parsing), computational semantics, computational pragmatics, etc.
-Linguistic Resources: including computational lexicography, terminology, electronic dictionaries, cross-lingual dictionaries, electronic thesauri, etc.
-Hardware and software algorithms and tools for Asian or low-resource language processing, e.g., handwritten character recognition.
-Information Understanding: including text understanding, speech understanding, character recognition, discourse processing, dialogue systems, etc.
-Machine Translation involving Asian or low-resource languages.
-Information Retrieval: including natural language processing (NLP) for concept-based indexing, natural language query interfaces, semantic relevance judgments, etc.
-Information Extraction and Filtering: including automatic abstraction, user profiling, etc.
-Speech processing: including text-to-speech synthesis and automatic speech recognition.
-Multimedia Asian Information Processing: including speech, image, video, image/text translation, etc.
-Cross-lingual information processing involving Asian or low-resource languages.
-Papers that deal in theory, systems design, evaluation and applications in the aforesaid subjects are appropriate for TALLIP. Emphasis will be placed on the originality and the practical significance of the reported research.