Amina：阿拉伯语多功能积分新闻文章数据集

Neural Computing and Applications Pub Date : 2024-09-18 DOI:10.1007/s00521-024-10277-0

Mohamed Zaytoon, Muhannad Bashar, Mohamed A. Khamis, Walid Gomaa

{"title":"Amina：阿拉伯语多功能积分新闻文章数据集","authors":"Mohamed Zaytoon, Muhannad Bashar, Mohamed A. Khamis, Walid Gomaa","doi":"10.1007/s00521-024-10277-0","DOIUrl":null,"url":null,"abstract":"<p>Electronic newspapers are one of the most common sources of Modern Standard Arabic. Existing datasets of Arabic news articles typically provide a title, body, and single label. Ignoring important features, like the article author, image, tags, and publication date, can degrade the efficacy of classification models. In this paper, we propose the Arabic multi-purpose integral news articles (AMINA) dataset. AMINA is a large-scale Arabic news corpus with over 1,850,000 articles collected from 9 Arabic newspapers from different countries. It includes all the article features: title, tags, publication date and time, location, author, article image and its caption, and the number of visits. To test the efficacy of the proposed dataset, three tasks were developed and validated: article textual content (classification and generation) and article image classification. For content classification, we experimented the performance of several state-of-the-art Arabic NLP models including AraBERT and CAMeL-BERT, etc. For content generation, the reformer architecture is adopted as a character text generation model. For image classification applied on Al-Sharq and Youm7 news portals, we have compared the performance of 10 pre-trained models including ConvNeXt, MaxViT, ResNet18, etc. The overall study verifies the significance and contribution of our newly introduced Arabic articles dataset. The AMINA dataset has been released at https://huggingface.co/datasets/MohamedZayton/AMINA.</p>","PeriodicalId":18925,"journal":{"name":"Neural Computing and Applications","volume":"25 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Amina: an Arabic multi-purpose integral news articles dataset\",\"authors\":\"Mohamed Zaytoon, Muhannad Bashar, Mohamed A. Khamis, Walid Gomaa\",\"doi\":\"10.1007/s00521-024-10277-0\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Electronic newspapers are one of the most common sources of Modern Standard Arabic. Existing datasets of Arabic news articles typically provide a title, body, and single label. Ignoring important features, like the article author, image, tags, and publication date, can degrade the efficacy of classification models. In this paper, we propose the Arabic multi-purpose integral news articles (AMINA) dataset. AMINA is a large-scale Arabic news corpus with over 1,850,000 articles collected from 9 Arabic newspapers from different countries. It includes all the article features: title, tags, publication date and time, location, author, article image and its caption, and the number of visits. To test the efficacy of the proposed dataset, three tasks were developed and validated: article textual content (classification and generation) and article image classification. For content classification, we experimented the performance of several state-of-the-art Arabic NLP models including AraBERT and CAMeL-BERT, etc. For content generation, the reformer architecture is adopted as a character text generation model. For image classification applied on Al-Sharq and Youm7 news portals, we have compared the performance of 10 pre-trained models including ConvNeXt, MaxViT, ResNet18, etc. The overall study verifies the significance and contribution of our newly introduced Arabic articles dataset. The AMINA dataset has been released at https://huggingface.co/datasets/MohamedZayton/AMINA.</p>\",\"PeriodicalId\":18925,\"journal\":{\"name\":\"Neural Computing and Applications\",\"volume\":\"25 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neural Computing and Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s00521-024-10277-0\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Computing and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00521-024-10277-0","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

电子报纸是现代标准阿拉伯语的最常见来源之一。现有的阿拉伯语新闻文章数据集通常提供标题、正文和单一标签。忽略文章作者、图片、标签和发布日期等重要特征会降低分类模型的效率。在本文中，我们提出了阿拉伯语多用途积分新闻文章（AMINA）数据集。AMINA 是一个大规模的阿拉伯语新闻语料库，包含来自不同国家 9 种阿拉伯语报纸的超过 1,850,000 篇文章。它包括所有文章特征：标题、标签、出版日期和时间、地点、作者、文章图片及其标题以及访问次数。为了测试建议数据集的功效，我们开发并验证了三个任务：文章文本内容（分类和生成）和文章图片分类。在内容分类方面，我们测试了几个最先进的阿拉伯语 NLP 模型的性能，包括 AraBERT 和 CAMeL-BERT 等。在内容生成方面，我们采用了 reformer 架构作为字符文本生成模型。在应用于 Al-Sharq 和 Youm7 新闻门户网站的图像分类方面，我们比较了包括 ConvNeXt、MaxViT、ResNet18 等在内的 10 个预训练模型的性能。整个研究验证了我们新推出的阿拉伯语文章数据集的意义和贡献。AMINA 数据集已在 https://huggingface.co/datasets/MohamedZayton/AMINA 上发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Amina: an Arabic multi-purpose integral news articles dataset

Electronic newspapers are one of the most common sources of Modern Standard Arabic. Existing datasets of Arabic news articles typically provide a title, body, and single label. Ignoring important features, like the article author, image, tags, and publication date, can degrade the efficacy of classification models. In this paper, we propose the Arabic multi-purpose integral news articles (AMINA) dataset. AMINA is a large-scale Arabic news corpus with over 1,850,000 articles collected from 9 Arabic newspapers from different countries. It includes all the article features: title, tags, publication date and time, location, author, article image and its caption, and the number of visits. To test the efficacy of the proposed dataset, three tasks were developed and validated: article textual content (classification and generation) and article image classification. For content classification, we experimented the performance of several state-of-the-art Arabic NLP models including AraBERT and CAMeL-BERT, etc. For content generation, the reformer architecture is adopted as a character text generation model. For image classification applied on Al-Sharq and Youm7 news portals, we have compared the performance of 10 pre-trained models including ConvNeXt, MaxViT, ResNet18, etc. The overall study verifies the significance and contribution of our newly introduced Arabic articles dataset. The AMINA dataset has been released at https://huggingface.co/datasets/MohamedZayton/AMINA.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Neural Computing and Applications

自引率

0.00%

发文量