{"title":"基于三重关系信息交叉注意网络的多模态仇恨模因检测","authors":"Xiaolin Liang, Yajuan Huang, Wen Liu, He Zhu, Zhao Liang, Libo Chen","doi":"10.1109/IJCNN55064.2022.9892164","DOIUrl":null,"url":null,"abstract":"Memes are spreading on social networking. Most are created to be humorous, while some become hateful with the combination of images and words, conveying negative information to people. The hateful memes detection poses an interesting multimodal fusion problem, unlike traditional multi-modal tasks, the majority of memos have images and text that are only weakly consistent or even uncorrelated, so various modalities contained in the data play an important role in predicting its results. In this paper, we attempt to work on the Facebook Meme challenge, which solves the binary classification task of predicting a meme's hatefulness or not. We extract triplet-relation information from origin OCR text features, image content features and image caption features and proposed a novel cross-attention network to address this task. TRICAN leverages object detection and image caption models to explore visual modalities to obtain “actual captions” and then combines combine origin OCR text with the multi-modal representation to perform hateful memes detection. These meme-related features are then reconstructed and fused into one feature vector for prediction. We have performed extensively experimental on multi-modal memory datasets. Experimental results demonstrate the effectiveness of TRICAN and the usefulness of triplet-relation information.","PeriodicalId":106974,"journal":{"name":"2022 International Joint Conference on Neural Networks (IJCNN)","volume":"41 4","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TRICAN: Multi-Modal Hateful Memes Detection with Triplet-Relation Information Cross-Attention Network\",\"authors\":\"Xiaolin Liang, Yajuan Huang, Wen Liu, He Zhu, Zhao Liang, Libo Chen\",\"doi\":\"10.1109/IJCNN55064.2022.9892164\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Memes are spreading on social networking. Most are created to be humorous, while some become hateful with the combination of images and words, conveying negative information to people. The hateful memes detection poses an interesting multimodal fusion problem, unlike traditional multi-modal tasks, the majority of memos have images and text that are only weakly consistent or even uncorrelated, so various modalities contained in the data play an important role in predicting its results. In this paper, we attempt to work on the Facebook Meme challenge, which solves the binary classification task of predicting a meme's hatefulness or not. We extract triplet-relation information from origin OCR text features, image content features and image caption features and proposed a novel cross-attention network to address this task. TRICAN leverages object detection and image caption models to explore visual modalities to obtain “actual captions” and then combines combine origin OCR text with the multi-modal representation to perform hateful memes detection. These meme-related features are then reconstructed and fused into one feature vector for prediction. We have performed extensively experimental on multi-modal memory datasets. Experimental results demonstrate the effectiveness of TRICAN and the usefulness of triplet-relation information.\",\"PeriodicalId\":106974,\"journal\":{\"name\":\"2022 International Joint Conference on Neural Networks (IJCNN)\",\"volume\":\"41 4\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 International Joint Conference on Neural Networks (IJCNN)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IJCNN55064.2022.9892164\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Joint Conference on Neural Networks (IJCNN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IJCNN55064.2022.9892164","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
TRICAN: Multi-Modal Hateful Memes Detection with Triplet-Relation Information Cross-Attention Network
Memes are spreading on social networking. Most are created to be humorous, while some become hateful with the combination of images and words, conveying negative information to people. The hateful memes detection poses an interesting multimodal fusion problem, unlike traditional multi-modal tasks, the majority of memos have images and text that are only weakly consistent or even uncorrelated, so various modalities contained in the data play an important role in predicting its results. In this paper, we attempt to work on the Facebook Meme challenge, which solves the binary classification task of predicting a meme's hatefulness or not. We extract triplet-relation information from origin OCR text features, image content features and image caption features and proposed a novel cross-attention network to address this task. TRICAN leverages object detection and image caption models to explore visual modalities to obtain “actual captions” and then combines combine origin OCR text with the multi-modal representation to perform hateful memes detection. These meme-related features are then reconstructed and fused into one feature vector for prediction. We have performed extensively experimental on multi-modal memory datasets. Experimental results demonstrate the effectiveness of TRICAN and the usefulness of triplet-relation information.