This paper describes the work performed for automated abusive language detection in the Khasi language, a low-resource language spoken primarily in the state of Meghalaya, India. A dataset named Khasi Abusive Language Dataset (KALD) was created which consists of 4,573 human-annotated Khasi YouTube and Facebook comments. A corpus of Khasi text was built and it was used to create Khasi word2vec and fastText word embeddings. Deep learning, traditional machine learning, and ensemble models were used in the study. Experiments were performed using word2vec, fastText, and topic vectors obtained using LDA. Experiments were also performed to check if zero-shot cross-lingual nature of language models such as LaBSE and LASER can be utilized for abusive language detection in the Khasi language. The best F1 score of 0.90725 was obtained by an XGBoost classifier. After feature selection and rebalancing of the dataset, F1 score of 0.91828 and 0.91945 were obtained by an SVM based classifiers.
Machine translation has been a prominent field of research, contributing significantly to human life enhancement. Sign language machine translation, a subfield, focuses on translating spoken language content into sign language and vice versa, thereby facilitating communication between the normal hearing and hard-of-hearing communities, promoting inclusivity.
This study presents the development of a ‘sign language machine translation system’ converting simple Marathi sentences into Indian Sign Language (ISL) glosses and animation. Given the low-resource nature of both languages, a phrase-level rule-based approach was employed for the translation. Initial encoding of translation rules relied on basic linguistic knowledge of Marathi and ISL, with subsequent incorporation of rules to address 'simultaneous morphological' features in ISL. These rules were applied during the ‘generation phase’ of translation to dynamically adjust phonological sign parameters, resulting in improved target sentence fluency.
The paper provides a detailed description of the system architecture, translation rules, and comprehensive experimentation. Rigorous evaluation efforts were undertaken, encompassing various linguistic features, and the findings are discussed herein.
The web-based version of the system serves as an interpreter for brief communications and can support the teaching and learning of sign language and its grammar in schools for hard-of-hearing students.
Shallow Parsing is an important step for many Natural Language Processing tasks. Although shallow parsing has a rich history for resource rich languages, it is not the case for most Indian languages. Shallow Parsing consists of POS Tagging and Chunking. Our study focuses on developing shallow parsers for Indian languages. As part of shallow parsing we included morph analysis as well.
For the study, we first consolidated available shallow parsing corpora for 7 Indian Languages (Hindi, Kannada, Bangla, Malayalam, Marathi, Urdu, Telugu) for which treebanks are publicly available. We then trained models to achieve state of the art performance for shallow parsing in these languages for multiple domains. Since analyzing the performance of model predictions at sentence level is more realistic, we report the performance of these shallow parsers not only at the token level, but also at the sentence level. We also present machine learning techniques for multitask shallow parsing. Our experiments show that fine-tuned contextual embedding with multi-task learning improves the performance of multiple as well as individual shallow parsing tasks across different domains. We show the transfer learning capability of these models by creating shallow parsers (only with POS and Chunk) for Gujarati, Odia, and Punjabi for which no treebanks are available.
As a part of this work, we will be releasing the Indian Languages Shallow Linguistic (ILSL) benchmarks for 10 Indian languages including both the major language families Indo-Aryan and Dravidian as common building blocks that can be used to evaluate and understand various linguistic phenomena found in Indian languages and how well newer approaches can tackle them.
The dependency syntactic structure is widely used in event extraction. However, the dependency structure reflecting syntactic features is essentially different from the event structure that reflects semantic features, leading to the performance degradation. In this paper, we propose to use Event Trigger Structure for Event Extraction (ETSEE), which can compensate the inconsistency between two structures. First, we leverage the ACE2005 dataset as case study, and annotate 3 kinds of ETSs, i.e., “light verb + trigger”, “preposition structures” and “tense + trigger”. Then we design a graph-based event extraction model that jointly identifies triggers and arguments, where the graph consists of both the dependency structure and ETSs. Experiments show that our model significantly outperforms the state-of-the-art methods. Through empirical analysis and manual observation, we find that the ETSs can bring the following benefits: (1) enriching trigger identification features by introducing structural event information; (2) enriching dependency structures with event semantic information; (3) enhancing the interactions between triggers and candidate arguments by shortening their distances in the dependency graph.
Nowadays, ways of communication among people have changed due to advancements in information technology and the rise of online multi-social media. Many people express their feelings, ideas, and emotions on social media sites such as Instagram, Twitter, Gab, Reddit, Facebook, YouTube, etc. However, people have misused social media to send hateful messages to specific individuals or groups to create chaos. For various Governance authorities, manually identifying hate speech on various social media platforms is a difficult task to avoid such chaos. In this study, a hybrid deep-learning model, where bidirectional long short-term memory (BiLSTM) and convolutional neural network (CNN) are used to classify hate speech in textual data, has been proposed. This model incorporates a GLOVE-based word embedding approach, dropout, L2 regularization, and global max pooling to get impressive results. Further, the proposed BiLSTM-CNN model has been evaluated on various datasets to achieve state-of-the-art performance that is superior to the traditional and existing machine learning methods in terms of accuracy, precision, recall, and F1-score.
Urdu, characterized by its intricate morphological structure and linguistic nuances, presents distinct challenges in computational sentiment analysis. Addressing these, we introduce ”UrduAspectNet” – a dedicated model tailored for Aspect-Based Sentiment Analysis (ABSA) in Urdu. Central to our approach is a rigorous preprocessing phase. Leveraging the Stanza library, we extract Part-of-Speech (POS) tags and lemmas, ensuring Urdu’s linguistic intricacies are aptly represented. To probe the effectiveness of different embeddings, we trained our model using both mBERT and XLM-R embeddings, comparing their performances to identify the most effective representation for Urdu ABSA. Recognizing the nuanced inter-relationships between words, especially in Urdu’s flexible syntactic constructs, our model incorporates a dual Graph Convolutional Network (GCN) layer.Addressing the challenge of the absence of a dedicated Urdu ABSA dataset, we curated our own, collecting over 4,603 news headlines from various domains, such as politics, entertainment, business, and sports. These headlines, sourced from diverse news platforms, not only identify prevalent aspects but also pinpoints their sentiment polarities, categorized as positive, negative, or neutral. Despite the inherent complexities of Urdu, such as its colloquial expressions and idioms, ”UrduAspectNet” showcases remarkable efficacy. Initial comparisons between mBERT and XLM-R embeddings integrated with dual GCN provide valuable insights into their respective strengths in the context of Urdu ABSA. With broad applications spanning media analytics, business insights, and socio-cultural analysis, ”UrduAspectNet” is positioned as a pivotal benchmark in Urdu ABSA research.
The relevance of the problem of automatic speech recognition lies in the lack of research for low-resource languages, stemming from limited training data and the necessity for new technologies to enhance efficiency and performance. The purpose of this work was to study the main aspects of integrated end-to-end speech recognition and the use of modern technologies in the natural processing of agglutinative languages, including Kazakh. In this article, the study of language models was carried out using comparative, graphic, statistical and analytical-synthetic methods, which were used in combination. This paper addresses automatic speech recognition (ASR) in agglutinative languages, particularly Kazakh, through a unified neural network model that integrates both acoustic and language modeling. Employing advanced techniques like connectionist temporal classification and attention mechanisms, the study focuses on effective speech-to-text transcription for languages with complex morphologies. Transfer learning from high-resource languages helps mitigate data scarcity in languages such as Kazakh, Kyrgyz, Uzbek, Turkish, and Azerbaijani. The research assesses model performance, underscores ASR challenges, and proposes advancements for these languages. It includes a comparative analysis of phonetic and word-formation features in agglutinative Turkic languages, using statistical data. The findings aid further research in linguistics and technology for enhancing speech recognition and synthesis, contributing to voice identification and automation processes.