首页 > 最新文献

2022 20th International Conference on Language Engineering (ESOLEC)最新文献

英文 中文
Arabic Documents Layout Analysis (ADLA) using Fine-tuned Faster RCN 阿拉伯语文档布局分析(ADLA)使用微调更快的RCN
Pub Date : 2022-10-12 DOI: 10.1109/ESOLEC54569.2022.10009375
Latifa Aljiffry, Hassanin M. Al-Barhamtoshy, A. Jamal, Felwa A. Abukhodair
At present, there is a massive interest in document digitization, image searching, and natural language processing models, using different types of models. The first step in applying any type of processing like the image to text converting, is layout analysis, which is the paper's interest field. The problem in layout analysis comes in Arabic language, where there is a well-noticed gap for research in this field. The main limitations of the existed research are common, the dataset size, where in its return, gives a not very accurate result. In this paper, we are using two distinct types of Arabic language datasets. We propose a tuned model for layout analysis for Arabic printed and early printed documents using Faster RCNN (ADLA). The proposed model is based on tuning Faster Region-based Convolutional Neural Network (RCNN) model to match our two datasets, with different regions of interest (RoI). For evaluation, we compared the proposed model with two distinct existing models (LABA & FFRA). The F1 score result for our proposed model exceeds the LABA model with 99.4%, whereas the LABA model has 90.5%. Our model exceeds the FFRA model with 99.59% accuracy, whereas the FFRA model got 99.83% accuracy result.
目前,人们对文档数字化、图像搜索和自然语言处理模型产生了浓厚的兴趣,并使用了不同类型的模型。将任何类型的处理(如图像)应用于文本转换的第一步是布局分析,这是本文感兴趣的领域。布局分析中的问题来自阿拉伯语,在该领域的研究中存在明显的空白。现有研究的主要限制是常见的,数据集大小,在它的返回中,给出的结果不是很准确。在本文中,我们使用两种不同类型的阿拉伯语数据集。我们提出了一个优化模型,用于阿拉伯语印刷和早期印刷文档的布局分析,使用更快的RCNN (ADLA)。该模型基于快速区域卷积神经网络(RCNN)模型来匹配我们的两个数据集,具有不同的感兴趣区域(RoI)。为了评估,我们将提出的模型与两个不同的现有模型(LABA和FFRA)进行了比较。我们提出的模型的F1得分结果超过LABA模型99.4%,而LABA模型的F1得分结果为90.5%。我们的模型以99.59%的准确率超过了FFRA模型,而FFRA模型获得了99.83%的准确率结果。
{"title":"Arabic Documents Layout Analysis (ADLA) using Fine-tuned Faster RCN","authors":"Latifa Aljiffry, Hassanin M. Al-Barhamtoshy, A. Jamal, Felwa A. Abukhodair","doi":"10.1109/ESOLEC54569.2022.10009375","DOIUrl":"https://doi.org/10.1109/ESOLEC54569.2022.10009375","url":null,"abstract":"At present, there is a massive interest in document digitization, image searching, and natural language processing models, using different types of models. The first step in applying any type of processing like the image to text converting, is layout analysis, which is the paper's interest field. The problem in layout analysis comes in Arabic language, where there is a well-noticed gap for research in this field. The main limitations of the existed research are common, the dataset size, where in its return, gives a not very accurate result. In this paper, we are using two distinct types of Arabic language datasets. We propose a tuned model for layout analysis for Arabic printed and early printed documents using Faster RCNN (ADLA). The proposed model is based on tuning Faster Region-based Convolutional Neural Network (RCNN) model to match our two datasets, with different regions of interest (RoI). For evaluation, we compared the proposed model with two distinct existing models (LABA & FFRA). The F1 score result for our proposed model exceeds the LABA model with 99.4%, whereas the LABA model has 90.5%. Our model exceeds the FFRA model with 99.59% accuracy, whereas the FFRA model got 99.83% accuracy result.","PeriodicalId":179850,"journal":{"name":"2022 20th International Conference on Language Engineering (ESOLEC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115297593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Smart Customer Care: Scraping Social Media to Predict Customer Satisfaction in Egypt Using Machine Learning Models 智能客户服务:利用机器学习模型抓取社交媒体预测埃及客户满意度
Pub Date : 2022-10-12 DOI: 10.1109/ESOLEC54569.2022.10009194
M. Anwar, Karim Omar, A. Abbas, Fakhreldin Abdelmonim, Mohammad Refaie, Walaa Medhat, Aly Abdelrazek, Yomna Eid, Eman Gawish
This paper proposes the utilization of posts from social media to extract and analyze customer opinions and sentiments towards any specified topic in Egypt. Summarized statistics and sentiment values are then displayed to the consumer (companies such as Vodafone, WE etc..) through both an attractive and functional user interface. Text, location, and time of thousands of posts are scrapped, stored, preprocessed, then managed through topic modelling to infer all the hidden themes and delivered to a Recurrent Neural Network (RNN) to output whether the topic was positive or negative. Topic modelling was implemented using the BERT architecture and AraBert word embedding. Sentiment analysis model training was conducted on approximately 4000 rows of processed data and made use of Arabic glove embedding to speed up sentiment and word pattern recognition. Five models were experimented on: LSTM, GRU, CNN, LSTM + CNN and GRU + CNN. Overall, the GRU was the model with the best results, concluding with an accuracy of (86.19%), loss of (0.3349) and an F1-score of (0.858) when validating through the test data.
本文提出利用社交媒体上的帖子来提取和分析埃及客户对任何特定主题的意见和情绪。然后,汇总的统计数据和情绪值通过一个有吸引力和功能的用户界面显示给消费者(如沃达丰、WE等公司)。数千篇帖子的文本、位置和时间被废弃、存储、预处理,然后通过主题建模来推断所有隐藏的主题,并将其传递给循环神经网络(RNN),以输出主题是积极的还是消极的。利用BERT架构和AraBert词嵌入实现主题建模。在处理后的约4000行数据上进行情感分析模型训练,利用阿拉伯手套嵌入加速情感和词模式识别。在LSTM、GRU、CNN、LSTM + CNN和GRU + CNN五种模型上进行了实验。总体而言,GRU是效果最好的模型,通过试验数据验证,其准确率为86.19%,损失为0.3349,f1评分为0.858。
{"title":"Smart Customer Care: Scraping Social Media to Predict Customer Satisfaction in Egypt Using Machine Learning Models","authors":"M. Anwar, Karim Omar, A. Abbas, Fakhreldin Abdelmonim, Mohammad Refaie, Walaa Medhat, Aly Abdelrazek, Yomna Eid, Eman Gawish","doi":"10.1109/ESOLEC54569.2022.10009194","DOIUrl":"https://doi.org/10.1109/ESOLEC54569.2022.10009194","url":null,"abstract":"This paper proposes the utilization of posts from social media to extract and analyze customer opinions and sentiments towards any specified topic in Egypt. Summarized statistics and sentiment values are then displayed to the consumer (companies such as Vodafone, WE etc..) through both an attractive and functional user interface. Text, location, and time of thousands of posts are scrapped, stored, preprocessed, then managed through topic modelling to infer all the hidden themes and delivered to a Recurrent Neural Network (RNN) to output whether the topic was positive or negative. Topic modelling was implemented using the BERT architecture and AraBert word embedding. Sentiment analysis model training was conducted on approximately 4000 rows of processed data and made use of Arabic glove embedding to speed up sentiment and word pattern recognition. Five models were experimented on: LSTM, GRU, CNN, LSTM + CNN and GRU + CNN. Overall, the GRU was the model with the best results, concluding with an accuracy of (86.19%), loss of (0.3349) and an F1-score of (0.858) when validating through the test data.","PeriodicalId":179850,"journal":{"name":"2022 20th International Conference on Language Engineering (ESOLEC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134552099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Automatic Detection of Various Types of Lung Cancer Based on Histopathological Images Using a Lightweight End-to-End CNN Approach 基于组织病理图像的轻量端到端CNN自动检测肺癌
Pub Date : 2022-10-12 DOI: 10.1109/ESOLEC54569.2022.10009108
Ahmed S. Sakr
Lung cancer is one of the main causes of death and illness, and malignant lung tumours are the leading cause of both. According to reports, lung cancer incidence is on the rise. Lung cancer histopathology is an important element of patient care. Using artificial intelligence methods for the identification of lung cancer can become a highly valuable approach. In this article, we offer a modified lightweight end-to-end deep learning strategy based on convolutional neural networks (CNN) to accurately identify lung cancer. In this method, the input histopathology pictures are normalized before being fed into the CNN model, which is then used to detect lung cancer. The effectiveness of our approach is assessed using a publicly accessible database of histopathological pictures and compared to the most advanced cancer detection methods already in use. The examination of the results indicates that the suggested deep model for lung cancer diagnosis yields results of 0.995 percent, which is a better accuracy than other approaches. Due to this excellent outcome, our method is computationally effective.
肺癌是导致死亡和疾病的主要原因之一,而恶性肺肿瘤是导致死亡和疾病的主要原因。据报道,肺癌的发病率正在上升。肺癌组织病理学是患者护理的重要组成部分。使用人工智能方法来识别肺癌可以成为一种非常有价值的方法。在本文中,我们提供了一种基于卷积神经网络(CNN)的改进轻量级端到端深度学习策略来准确识别肺癌。该方法将输入的组织病理图像进行归一化处理后,再输入到CNN模型中,用于肺癌的检测。我们的方法的有效性是通过一个可公开访问的组织病理学图片数据库来评估的,并与目前使用的最先进的癌症检测方法进行了比较。对结果的检验表明,所提出的肺癌深度诊断模型的准确率为0.995%,优于其他方法。由于这个优异的结果,我们的方法在计算上是有效的。
{"title":"Automatic Detection of Various Types of Lung Cancer Based on Histopathological Images Using a Lightweight End-to-End CNN Approach","authors":"Ahmed S. Sakr","doi":"10.1109/ESOLEC54569.2022.10009108","DOIUrl":"https://doi.org/10.1109/ESOLEC54569.2022.10009108","url":null,"abstract":"Lung cancer is one of the main causes of death and illness, and malignant lung tumours are the leading cause of both. According to reports, lung cancer incidence is on the rise. Lung cancer histopathology is an important element of patient care. Using artificial intelligence methods for the identification of lung cancer can become a highly valuable approach. In this article, we offer a modified lightweight end-to-end deep learning strategy based on convolutional neural networks (CNN) to accurately identify lung cancer. In this method, the input histopathology pictures are normalized before being fed into the CNN model, which is then used to detect lung cancer. The effectiveness of our approach is assessed using a publicly accessible database of histopathological pictures and compared to the most advanced cancer detection methods already in use. The examination of the results indicates that the suggested deep model for lung cancer diagnosis yields results of 0.995 percent, which is a better accuracy than other approaches. Due to this excellent outcome, our method is computationally effective.","PeriodicalId":179850,"journal":{"name":"2022 20th International Conference on Language Engineering (ESOLEC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134290024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Critical Survey on Arabic Named Entity Recognition and Diacritization Systems 阿拉伯语命名实体识别和变音符系统综述
Pub Date : 2022-10-12 DOI: 10.1109/ESOLEC54569.2022.10009095
Muhammad Nabil Rateb, S. Alansary
Language technologies are considered a subdivision of the Artificial intelligence (AI) field, which sheds light on how toolkits are programmed to simulate the natural language of humans. Over the last decennia, there has been a unique advancement in the Natura Language Processing (NLP) field, namely regarding the Arabic language. Arabic is the language spoken by almost two billion Muslims worldwide and is one of the six officially acknowledged languages by the UN organization. This paper is dedicated to a survey on three cutting-edge toolkits utilized to process and analyze the Arabic language: Cameltools, Farasa, and Madamira. This paper presents a background on the challenges that have confronted Arabic Natura Language Processing (ANLP), predominantly concerning diacritization, and Named Entity Recognition (NER) systems. Next, it illustrates what are the main components of Cameltools, Farasa, and Madamira. After that, the evaluation processes of the three toolkits shall be presented and their results will be compared. Finally, the paper shall present observations based on the previous comparison. The survey reveals that Camel is the best since it has been inspired by the designs of the best toolkits provided in the field. Farasa outpaces Madamira in all comparisons regarding ANER and Arabic diacritization.
语言技术被认为是人工智能(AI)领域的一个分支,它揭示了如何编程工具包来模拟人类的自然语言。在过去的十年里,自然语言处理(NLP)领域有了一个独特的进步,即关于阿拉伯语。阿拉伯语是全世界近20亿穆斯林使用的语言,是联合国组织官方承认的六种语言之一。本文致力于调查用于处理和分析阿拉伯语的三个前沿工具包:Cameltools, Farasa和Madamira。本文介绍了阿拉伯语自然语言处理(ANLP)所面临的挑战的背景,主要涉及变音符化和命名实体识别(NER)系统。接下来,它说明了Cameltools, Farasa和Madamira的主要组成部分。然后,介绍三个工具包的评估过程,并对其结果进行比较。最后,本文将根据前面的比较提出观察结果。调查显示,Camel是最好的,因为它的灵感来自于该领域提供的最好的工具包设计。在所有关于ANER和阿拉伯语变音符的比较中,Farasa都超过了Madamira。
{"title":"A Critical Survey on Arabic Named Entity Recognition and Diacritization Systems","authors":"Muhammad Nabil Rateb, S. Alansary","doi":"10.1109/ESOLEC54569.2022.10009095","DOIUrl":"https://doi.org/10.1109/ESOLEC54569.2022.10009095","url":null,"abstract":"Language technologies are considered a subdivision of the Artificial intelligence (AI) field, which sheds light on how toolkits are programmed to simulate the natural language of humans. Over the last decennia, there has been a unique advancement in the Natura Language Processing (NLP) field, namely regarding the Arabic language. Arabic is the language spoken by almost two billion Muslims worldwide and is one of the six officially acknowledged languages by the UN organization. This paper is dedicated to a survey on three cutting-edge toolkits utilized to process and analyze the Arabic language: Cameltools, Farasa, and Madamira. This paper presents a background on the challenges that have confronted Arabic Natura Language Processing (ANLP), predominantly concerning diacritization, and Named Entity Recognition (NER) systems. Next, it illustrates what are the main components of Cameltools, Farasa, and Madamira. After that, the evaluation processes of the three toolkits shall be presented and their results will be compared. Finally, the paper shall present observations based on the previous comparison. The survey reveals that Camel is the best since it has been inspired by the designs of the best toolkits provided in the field. Farasa outpaces Madamira in all comparisons regarding ANER and Arabic diacritization.","PeriodicalId":179850,"journal":{"name":"2022 20th International Conference on Language Engineering (ESOLEC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124048287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Comparison of Different Deep Learning Approaches to Arabic Sarcasm Detection 不同深度学习方法在阿拉伯语讽刺检测中的比较
Pub Date : 2022-10-12 DOI: 10.1109/ESOLEC54569.2022.10009500
M. Galal, Ahmed Hassan, Hala H. Zayed, Walaa Medhat
Irony and Sarcasm Detection (ISD) is a crucial task for many NLP applications, especially sentiment and opinion mining. It is also considered a challenging task even for humans. Several studies have focused on employing Deep Learning (DL) approaches, including building Deep Neural Networks (DNN) to detect irony and sarcasm content. However, most of them concentrated on detecting sarcasm in English rather than Arabic content. Especially studies concerning deep neural networks, including convolutional neural networks (CNN) and recurrent neural network (RNN) architectures. This paper investigates several deep learning approaches, including DNNs and fine-tuned pretrained transformer-based language models, for identifying Arabic sarcastic tweets. In addition, it presents a comprehensive evaluation of the impact of data preprocessing techniques and several pretrained word embedding models on the performance of the proposed deep models. Two shared tasks' datasets on Arabic sarcasm detection are used to develop, fine-tune, and evaluate the different techniques and methods presented in this paper. Results on the first dataset showed that fine-tuned pretrained transformer-based language model outperformed the developed DNNs. The proposed DNN models obtained comparable performance on the second dataset to the fine-tuned models. Results also proved the necessity of applying preprocessing techniques with the various Deep Learning approaches for better detection performance of these models.
反讽和讽刺检测(ISD)是许多自然语言处理应用的关键任务,尤其是情感和意见挖掘。即使对人类来说,这也被认为是一项具有挑战性的任务。一些研究专注于使用深度学习(DL)方法,包括构建深度神经网络(DNN)来检测反语和讽刺内容。然而,大多数测试集中在检测英语内容中的讽刺,而不是阿拉伯语内容。特别是关于深度神经网络的研究,包括卷积神经网络(CNN)和递归神经网络(RNN)架构。本文研究了几种深度学习方法,包括dnn和微调预训练的基于变压器的语言模型,用于识别阿拉伯语讽刺推文。此外,本文还全面评估了数据预处理技术和几种预训练词嵌入模型对所提出的深度模型性能的影响。两个关于阿拉伯语讽刺检测的共享任务数据集用于开发,微调和评估本文中提出的不同技术和方法。在第一个数据集上的结果表明,经过微调的预训练的基于变压器的语言模型优于开发的dnn。提出的深度神经网络模型在第二个数据集上获得了与微调模型相当的性能。结果还证明了将预处理技术与各种深度学习方法结合使用以提高这些模型的检测性能的必要性。
{"title":"Comparison of Different Deep Learning Approaches to Arabic Sarcasm Detection","authors":"M. Galal, Ahmed Hassan, Hala H. Zayed, Walaa Medhat","doi":"10.1109/ESOLEC54569.2022.10009500","DOIUrl":"https://doi.org/10.1109/ESOLEC54569.2022.10009500","url":null,"abstract":"Irony and Sarcasm Detection (ISD) is a crucial task for many NLP applications, especially sentiment and opinion mining. It is also considered a challenging task even for humans. Several studies have focused on employing Deep Learning (DL) approaches, including building Deep Neural Networks (DNN) to detect irony and sarcasm content. However, most of them concentrated on detecting sarcasm in English rather than Arabic content. Especially studies concerning deep neural networks, including convolutional neural networks (CNN) and recurrent neural network (RNN) architectures. This paper investigates several deep learning approaches, including DNNs and fine-tuned pretrained transformer-based language models, for identifying Arabic sarcastic tweets. In addition, it presents a comprehensive evaluation of the impact of data preprocessing techniques and several pretrained word embedding models on the performance of the proposed deep models. Two shared tasks' datasets on Arabic sarcasm detection are used to develop, fine-tune, and evaluate the different techniques and methods presented in this paper. Results on the first dataset showed that fine-tuned pretrained transformer-based language model outperformed the developed DNNs. The proposed DNN models obtained comparable performance on the second dataset to the fine-tuned models. Results also proved the necessity of applying preprocessing techniques with the various Deep Learning approaches for better detection performance of these models.","PeriodicalId":179850,"journal":{"name":"2022 20th International Conference on Language Engineering (ESOLEC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130661238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sentiment Analysis: Amazon Electronics Reviews Using BERT and Textblob 情感分析:使用BERT和Textblob的亚马逊电子产品评论
Pub Date : 2022-10-12 DOI: 10.1109/ESOLEC54569.2022.10009176
Abdulrahman Mahgoub, Hesham Atef, Abdulrahman Nasser, Mohamed Yasser, Walaa Medhat, M. Darweesh, Passent El-Kafrawy
The market needs a deeper and more comprehensive grasp of its insight, where the analytics world and methodologies such as “Sentiment Analysis” come in. These methods can assist people especially “business owners” in gaining live insights into their businesses and determining wheatear customers are satisfied or not. This paper plans to provide indicators by gathering real world Amazon reviews from Egyptian customers. By applying both Bidirectional Encoder Representations from Transformers “Bert” and “Text Blob” sentiment analysis methods. The processes shall determine the overall satisfaction of Egyptian customers in the electronics department - in order to focus on a specific domain. The two methods will be compared for both the Arabic and English languages. The results show that people in Amazon.eg are mostly satisfied with the percentage of 47%. For the performance, BERT outperformed Textblob indicating that word embedding model BERT is more superior than rule-based model Textblob with a difference of 15% - 25%.
市场需要更深入、更全面地把握其洞察力,而“情绪分析”(Sentiment Analysis)等分析领域和方法就在这里发挥了作用。这些方法可以帮助人们,特别是“企业主”获得对他们业务的实时洞察,并确定客户是否满意。本文计划通过收集来自埃及消费者的真实亚马逊评论来提供指标。通过应用变形金刚的双向编码器表示“Bert”和“Text Blob”情感分析方法。该流程将决定埃及客户在电子部门的总体满意度-以便专注于特定领域。这两种方法将对阿拉伯文和英文进行比较。结果表明,人们在亚马逊。大多数人满意的百分比为47%。在性能上,BERT优于Textblob,这表明词嵌入模型BERT比基于规则的模型Textblob更优越,差异为15% - 25%。
{"title":"Sentiment Analysis: Amazon Electronics Reviews Using BERT and Textblob","authors":"Abdulrahman Mahgoub, Hesham Atef, Abdulrahman Nasser, Mohamed Yasser, Walaa Medhat, M. Darweesh, Passent El-Kafrawy","doi":"10.1109/ESOLEC54569.2022.10009176","DOIUrl":"https://doi.org/10.1109/ESOLEC54569.2022.10009176","url":null,"abstract":"The market needs a deeper and more comprehensive grasp of its insight, where the analytics world and methodologies such as “Sentiment Analysis” come in. These methods can assist people especially “business owners” in gaining live insights into their businesses and determining wheatear customers are satisfied or not. This paper plans to provide indicators by gathering real world Amazon reviews from Egyptian customers. By applying both Bidirectional Encoder Representations from Transformers “Bert” and “Text Blob” sentiment analysis methods. The processes shall determine the overall satisfaction of Egyptian customers in the electronics department - in order to focus on a specific domain. The two methods will be compared for both the Arabic and English languages. The results show that people in Amazon.eg are mostly satisfied with the percentage of 47%. For the performance, BERT outperformed Textblob indicating that word embedding model BERT is more superior than rule-based model Textblob with a difference of 15% - 25%.","PeriodicalId":179850,"journal":{"name":"2022 20th International Conference on Language Engineering (ESOLEC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114301669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Improving The Performance of Semantic Text Similarity Tasks on Short Text Pairs 提高短文本对语义文本相似度任务的性能
Pub Date : 2022-10-12 DOI: 10.1109/ESOLEC54569.2022.10009072
Mohamed Taher Gamal, Passent El-Kafrawy
Training semantic similarity model to detect duplicate text pairs is a challenging task as almost all of datasets are imbalanced, by data nature positive samples are fewer than negative samples, this issue can easily lead to model bias. Using traditional pairwise loss functions like pairwise binary cross entropy or Contrastive loss on imbalanced data may lead to model bias, however triplet loss showed improved performance compared to other loss functions. In triplet loss-based models data is fed to the model as follow: anchor sentence, positive sentence and negative sentence. The original data is permutated to follow the input structure. The default structure of training samples data is 363,861 training samples (90% of the data) distributed as 134,336 positive samples and 229,524 negative samples. The triplet structured data helped to generate much larger amount of balanced training samples 456,219. The test results showed higher accuracy and f1 scores in testing. We fine-tunned RoBERTa pre trained model using Triplet loss approach, testing showed better results. The best model scored 89.51 F1 score, and 91.45 Accuracy compared to 86.74 F1 score and 87.45 Accuracy in the second-best Contrastive loss-based BERT model.
训练语义相似度模型来检测重复文本对是一项具有挑战性的任务,因为几乎所有的数据集都是不平衡的,从数据本质上来说,正样本少于负样本,这个问题很容易导致模型偏差。在不平衡数据上使用传统的两两二元交叉熵或对比损失等两两损失函数可能会导致模型偏差,而三重损失函数相比其他损失函数表现出更好的性能。在基于三联体损失的模型中,数据被输入到模型中:锚定句、肯定句和否定句。原始数据按照输入结构进行排列。训练样本数据的默认结构为363,861个训练样本(占数据的90%),分布为134,336个正样本和229,524个负样本。三联体结构化数据有助于生成更大数量的平衡训练样本456,219。测试结果显示出较高的准确性和测试f1分。我们使用三重损失方法对RoBERTa预训练模型进行了微调,测试显示出较好的效果。最佳模型的F1得分为89.51分,准确率为91.45分,第二好的基于对比损失的BERT模型的F1得分为86.74分,准确率为87.45分。
{"title":"Improving The Performance of Semantic Text Similarity Tasks on Short Text Pairs","authors":"Mohamed Taher Gamal, Passent El-Kafrawy","doi":"10.1109/ESOLEC54569.2022.10009072","DOIUrl":"https://doi.org/10.1109/ESOLEC54569.2022.10009072","url":null,"abstract":"Training semantic similarity model to detect duplicate text pairs is a challenging task as almost all of datasets are imbalanced, by data nature positive samples are fewer than negative samples, this issue can easily lead to model bias. Using traditional pairwise loss functions like pairwise binary cross entropy or Contrastive loss on imbalanced data may lead to model bias, however triplet loss showed improved performance compared to other loss functions. In triplet loss-based models data is fed to the model as follow: anchor sentence, positive sentence and negative sentence. The original data is permutated to follow the input structure. The default structure of training samples data is 363,861 training samples (90% of the data) distributed as 134,336 positive samples and 229,524 negative samples. The triplet structured data helped to generate much larger amount of balanced training samples 456,219. The test results showed higher accuracy and f1 scores in testing. We fine-tunned RoBERTa pre trained model using Triplet loss approach, testing showed better results. The best model scored 89.51 F1 score, and 91.45 Accuracy compared to 86.74 F1 score and 87.45 Accuracy in the second-best Contrastive loss-based BERT model.","PeriodicalId":179850,"journal":{"name":"2022 20th International Conference on Language Engineering (ESOLEC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124972669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Guidelines of Building a Treebank for Modern Standard Arabic 现代标准阿拉伯语树库构建指南
Pub Date : 2022-10-12 DOI: 10.1109/ESOLEC54569.2022.10009330
Amena Dheif, Ahmed Abd El Ghany, Sameh Al Ansary
Treebanks are one of the most needed and used linguistic resources in the fields of Natural language processing (NLP) and Natural language understanding (NLU). Arabic has only two constituency-based treebanks and a number of dependency treebanks. The current research presents the guidelines for building a parsed Arabic treebank for Modern Standard Arabic (MSA). The guidelines show, firstly the choice of the grammar formalism, then the genre and size of the treebank, and finally the annotation layers of the treebank. The study also shows that using the traditional Arabic grammar syntactic theory to describe the Arabic syntax has proven to be more suitable than using any of the modern syntax theories. Working with the traditional Arabic grammar also helps avoid the errors that the available treebank fell in as a result of using guidelines that don't suit the Arabic grammar. The study adopts three layers of annotations: the morphological layer, the syntactic layer, and the grammatical function layer. The resultant tree is a very detailed and rich syntactic tree, which is preferable by the researcher over having a huge amount of data poorly and shallowly annotated.
树库是自然语言处理(NLP)和自然语言理解(NLU)领域最需要和使用的语言资源之一。阿拉伯语只有两个基于选区的树库和一些依赖树库。目前的研究提出了为现代标准阿拉伯语(MSA)建立解析阿拉伯语树库的指导方针。该指南首先给出了语法形式的选择,然后给出了树库的类型和大小,最后给出了树库的标注层。研究还表明,用传统的阿拉伯语语法语法理论来描述阿拉伯语语法比用任何现代语法理论都更合适。使用传统的阿拉伯语语法也有助于避免由于使用不适合阿拉伯语语法的指导方针而导致的可用树库的错误。本研究采用三层注释:形态层、句法层和语法功能层。生成的树是一个非常详细和丰富的语法树,这比有大量数据的糟糕和肤浅的注释更受研究人员的欢迎。
{"title":"The Guidelines of Building a Treebank for Modern Standard Arabic","authors":"Amena Dheif, Ahmed Abd El Ghany, Sameh Al Ansary","doi":"10.1109/ESOLEC54569.2022.10009330","DOIUrl":"https://doi.org/10.1109/ESOLEC54569.2022.10009330","url":null,"abstract":"Treebanks are one of the most needed and used linguistic resources in the fields of Natural language processing (NLP) and Natural language understanding (NLU). Arabic has only two constituency-based treebanks and a number of dependency treebanks. The current research presents the guidelines for building a parsed Arabic treebank for Modern Standard Arabic (MSA). The guidelines show, firstly the choice of the grammar formalism, then the genre and size of the treebank, and finally the annotation layers of the treebank. The study also shows that using the traditional Arabic grammar syntactic theory to describe the Arabic syntax has proven to be more suitable than using any of the modern syntax theories. Working with the traditional Arabic grammar also helps avoid the errors that the available treebank fell in as a result of using guidelines that don't suit the Arabic grammar. The study adopts three layers of annotations: the morphological layer, the syntactic layer, and the grammatical function layer. The resultant tree is a very detailed and rich syntactic tree, which is preferable by the researcher over having a huge amount of data poorly and shallowly annotated.","PeriodicalId":179850,"journal":{"name":"2022 20th International Conference on Language Engineering (ESOLEC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116442242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Identifying Equivalent Words from Different Arabic Dialects Using Deep Learning Techniques 使用深度学习技术识别不同阿拉伯语方言中的等效词
Pub Date : 2022-10-12 DOI: 10.1109/ESOLEC54569.2022.10009555
Hamed Ramadan, Mohammad M. Alqahtani, Abdullah Algoson
The Arabic language comprises many spoken dialects. These dialects vary from a standard written Modern Standard Arabic (MSA) in terms of syntactic, lexical, phonological, and morphological. Arabic Dialects differ, not only along a geographical continuum, but also with other sociolinguistic factors such as the urban, rural, Bedouin dimension. Currently, Dialectal Arabic (DA) is the essential written language of unofficial communication in the Arab World. These Dialects can be found on social media platforms, emails, Twitter, etc. There has been a high interest in research on computational models of Arabic dialects in the last decade. Most of these studies focus on Arabic dialect identification (classification) and building Arabic dialect corpora. However, finding Arabic dialect word synonyms from another Arabic dialects has received limited attention. To bridge this gap, this study will develop a model to identify the equivalent words from different Arab world dialects using deep learning techniques such as word2vec. This research merged and extended the existing Arabic dialects corpora and then applied some deep learning techniques to achieve the best results for dialectal word synonyms. The outcomes of this research are a new dataset of Arabic dialectical word synonyms and a model with acceptable accuracy of 81%.
阿拉伯语包括许多口语方言。这些方言在句法、词汇、音系和形态方面不同于标准的书面现代标准阿拉伯语(MSA)。阿拉伯语方言的不同,不仅在地理连续体上,而且在其他社会语言学因素上,如城市、农村、贝都因方面。目前,方言阿拉伯语(DA)是阿拉伯世界非正式交流的基本书面语言。这些方言可以在社交媒体平台、电子邮件、推特等上找到。近十年来,对阿拉伯语方言计算模型的研究引起了人们极大的兴趣。这些研究大多集中在阿拉伯语方言识别(分类)和阿拉伯语方言语料库的构建上。然而,从另一种阿拉伯方言中寻找阿拉伯方言词汇的同义词受到的关注有限。为了弥补这一差距,本研究将开发一个模型,使用word2vec等深度学习技术识别来自不同阿拉伯世界方言的等效单词。本研究对现有的阿拉伯语方言语料库进行合并和扩展,然后应用一些深度学习技术来获得方言词同义词的最佳结果。本研究的结果是一个新的阿拉伯语辩证词同义词数据集和一个可接受的准确率为81%的模型。
{"title":"Identifying Equivalent Words from Different Arabic Dialects Using Deep Learning Techniques","authors":"Hamed Ramadan, Mohammad M. Alqahtani, Abdullah Algoson","doi":"10.1109/ESOLEC54569.2022.10009555","DOIUrl":"https://doi.org/10.1109/ESOLEC54569.2022.10009555","url":null,"abstract":"The Arabic language comprises many spoken dialects. These dialects vary from a standard written Modern Standard Arabic (MSA) in terms of syntactic, lexical, phonological, and morphological. Arabic Dialects differ, not only along a geographical continuum, but also with other sociolinguistic factors such as the urban, rural, Bedouin dimension. Currently, Dialectal Arabic (DA) is the essential written language of unofficial communication in the Arab World. These Dialects can be found on social media platforms, emails, Twitter, etc. There has been a high interest in research on computational models of Arabic dialects in the last decade. Most of these studies focus on Arabic dialect identification (classification) and building Arabic dialect corpora. However, finding Arabic dialect word synonyms from another Arabic dialects has received limited attention. To bridge this gap, this study will develop a model to identify the equivalent words from different Arab world dialects using deep learning techniques such as word2vec. This research merged and extended the existing Arabic dialects corpora and then applied some deep learning techniques to achieve the best results for dialectal word synonyms. The outcomes of this research are a new dataset of Arabic dialectical word synonyms and a model with acceptable accuracy of 81%.","PeriodicalId":179850,"journal":{"name":"2022 20th International Conference on Language Engineering (ESOLEC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130825228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sentiment Analysis From Subjectivity to (Im)Politeness Detection: Hate Speech From a Socio-Pragmatic Perspective 从主体性到礼貌检测的情感分析:社会语用视角下的仇恨言论
Pub Date : 2022-10-12 DOI: 10.1109/ESOLEC54569.2022.10009298
Samar Assem, S. Alansary
Although sentiment analysis by definition is that field of Natural Language processing which focuses on analyzing texts that tackle evaluating, analyzing and detecting the state of mind of the human beings towards a range of domains, most of the studies limit it to opinion mining. Yet, opinion mining is just one sub-field of three others under the umbrella of sentiment analysis which are; opinion mining, emotion mining and ambiguity detection. Noticeably, ambiguity detection is considered to be a combination of the other two sub-fields thanks to its linguistic nature that considers statistical and/or syntactic-semantic levels of analysis are not adequate to reach a satisfying level of disambiguating human language. Henceforth, the current paper proposes digging deeply to reach pragmatic and socio-pragmatic levels of analysis in order to eliminate ambiguity and avoid misjudgments over texts and social media posts specifically in the sub-tasks of detecting hate speech. Accordingly, it suggests utilizing an eclectic linguistic model of analysis includes speech act theory and the theory of (im)politeness.
虽然情感分析的定义是自然语言处理的一个领域,其重点是分析文本,以评估、分析和检测人类在一系列领域的心理状态,但大多数研究将其局限于意见挖掘。然而,意见挖掘只是情绪分析下其他三个子领域中的一个,它们是;观点挖掘、情感挖掘和歧义检测。值得注意的是,歧义检测被认为是其他两个子领域的结合,因为它的语言学性质认为统计和/或句法语义分析水平不足以达到令人满意的消除人类语言歧义的水平。因此,本文建议深入挖掘,达到语用和社会语用层面的分析,以消除歧义,避免对文本和社交媒体帖子的误判,特别是在检测仇恨言论的子任务中。因此,本文建议使用折衷的语言分析模型,包括言语行为理论和(非)礼貌理论。
{"title":"Sentiment Analysis From Subjectivity to (Im)Politeness Detection: Hate Speech From a Socio-Pragmatic Perspective","authors":"Samar Assem, S. Alansary","doi":"10.1109/ESOLEC54569.2022.10009298","DOIUrl":"https://doi.org/10.1109/ESOLEC54569.2022.10009298","url":null,"abstract":"Although sentiment analysis by definition is that field of Natural Language processing which focuses on analyzing texts that tackle evaluating, analyzing and detecting the state of mind of the human beings towards a range of domains, most of the studies limit it to opinion mining. Yet, opinion mining is just one sub-field of three others under the umbrella of sentiment analysis which are; opinion mining, emotion mining and ambiguity detection. Noticeably, ambiguity detection is considered to be a combination of the other two sub-fields thanks to its linguistic nature that considers statistical and/or syntactic-semantic levels of analysis are not adequate to reach a satisfying level of disambiguating human language. Henceforth, the current paper proposes digging deeply to reach pragmatic and socio-pragmatic levels of analysis in order to eliminate ambiguity and avoid misjudgments over texts and social media posts specifically in the sub-tasks of detecting hate speech. Accordingly, it suggests utilizing an eclectic linguistic model of analysis includes speech act theory and the theory of (im)politeness.","PeriodicalId":179850,"journal":{"name":"2022 20th International Conference on Language Engineering (ESOLEC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131624176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2022 20th International Conference on Language Engineering (ESOLEC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1