A meme is a part of media created to share an opinion or emotion across the internet. Due to their popularity, memes have become the new form of communication on social media. However, they are used in harmful ways such as trolling and cyberbullying progressively due to their nature. Various data modelling methods create different possibilities in feature extraction and turn them into beneficial information. The variety of modalities included in data plays a significant part in predicting the results. We try to explore the significance of visual features of images in classifying memes. Memes are a blend of both image and text, where the text is embedded into the picture. We consider a meme to be trolling if the meme in any way tries to troll a particular individual, group, or organisation. We try to incorporate the memes as a troll and non-trolling memes based on their images and text. We evaluate if there is any major significance of the visual features for identifying whether a meme is trolling or not. Our work illustrates different textual analysis methods and contrasting multimodal approaches ranging from simple merging to cross attention to utilising both worlds’—visual and textual features. The fine-tuned cross-lingual language model, XLM, performed the best in textual analysis, and the multimodal transformer performs the best in multimodal analysis.