基于RSS源和上下文数据的时效性在线新闻文章相似度检测

Mohammad Daoud
{"title":"基于RSS源和上下文数据的时效性在线新闻文章相似度检测","authors":"Mohammad Daoud","doi":"10.33166/aetic.2023.01.006","DOIUrl":null,"url":null,"abstract":"This article tackles the problem of finding similarity between web time-sensitive news articles, which can be a challenge. This challenge was approached with a novel methodology that uses supervised learning algorithms with carefully selected features (Semantic, Lexical and Temporal features (content and contextual features)). The proposed approach considers not only the textual content, which is a well-studied approach that may yield misleading results, but also the context, community engagement, and community-deduced importance of that news article. This paper details the major procedures of title pair pre-processing, analysis of lexical units, feature engineering, and similarity measures. Thousands of web articles are being published every second, and therefore, it is essential to determine the similarity of these articles efficiently without wasting time on unnecessary text processing of the bodies. Hence, the proposed approach focuses on short contents (titles) and context. The conducted experiment showed high precision and accuracy on a Really Simple Syndication (RSS) dataset of 8000 Arabic news article pairs collected automatically from 10 different news sources. The proposed approach achieved an accuracy of 0.81. Contextual features increased the accuracy and the precision. The proposed algorithm achieved a 0.89 correlation with the evaluations of two human judges based on Pearson’s Correlation Coefficient. The results outperform the state-of-the-art systems on Arabic news articles.","PeriodicalId":36440,"journal":{"name":"Annals of Emerging Technologies in Computing","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Similarity Detection of Time-Sensitive Online News Articles Based on RSS Feeds and Contextual Data\",\"authors\":\"Mohammad Daoud\",\"doi\":\"10.33166/aetic.2023.01.006\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This article tackles the problem of finding similarity between web time-sensitive news articles, which can be a challenge. This challenge was approached with a novel methodology that uses supervised learning algorithms with carefully selected features (Semantic, Lexical and Temporal features (content and contextual features)). The proposed approach considers not only the textual content, which is a well-studied approach that may yield misleading results, but also the context, community engagement, and community-deduced importance of that news article. This paper details the major procedures of title pair pre-processing, analysis of lexical units, feature engineering, and similarity measures. Thousands of web articles are being published every second, and therefore, it is essential to determine the similarity of these articles efficiently without wasting time on unnecessary text processing of the bodies. Hence, the proposed approach focuses on short contents (titles) and context. The conducted experiment showed high precision and accuracy on a Really Simple Syndication (RSS) dataset of 8000 Arabic news article pairs collected automatically from 10 different news sources. The proposed approach achieved an accuracy of 0.81. Contextual features increased the accuracy and the precision. The proposed algorithm achieved a 0.89 correlation with the evaluations of two human judges based on Pearson’s Correlation Coefficient. The results outperform the state-of-the-art systems on Arabic news articles.\",\"PeriodicalId\":36440,\"journal\":{\"name\":\"Annals of Emerging Technologies in Computing\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Annals of Emerging Technologies in Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.33166/aetic.2023.01.006\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"Computer Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Emerging Technologies in Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.33166/aetic.2023.01.006","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 0

摘要

本文解决了在网络时间敏感型新闻文章之间寻找相似性的问题,这可能是一个挑战。我们采用了一种新颖的方法来应对这一挑战,该方法使用了带有精心选择的特征(语义、词汇和时间特征(内容和上下文特征))的监督学习算法。所提出的方法不仅考虑了文本内容(这是一种经过充分研究的方法,可能会产生误导性的结果),还考虑了新闻文章的背景、社区参与和社区推断的重要性。本文详细介绍了标题对预处理、词汇单位分析、特征工程和相似度度量的主要步骤。每秒钟都有成千上万的网络文章被发布,因此,有效地确定这些文章的相似性是至关重要的,而不是浪费时间在不必要的正文文本处理上。因此,建议的方法侧重于短内容(标题)和上下文。所进行的实验显示,在从10个不同的新闻来源自动收集的8000个阿拉伯语新闻文章对的RSS数据集上,具有很高的精度和准确性。该方法的准确率为0.81。上下文特征提高了准确性和精度。基于Pearson’s correlation Coefficient,该算法与两名人类裁判的评价相关度达到0.89。结果优于最先进的阿拉伯语新闻文章系统。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Similarity Detection of Time-Sensitive Online News Articles Based on RSS Feeds and Contextual Data
This article tackles the problem of finding similarity between web time-sensitive news articles, which can be a challenge. This challenge was approached with a novel methodology that uses supervised learning algorithms with carefully selected features (Semantic, Lexical and Temporal features (content and contextual features)). The proposed approach considers not only the textual content, which is a well-studied approach that may yield misleading results, but also the context, community engagement, and community-deduced importance of that news article. This paper details the major procedures of title pair pre-processing, analysis of lexical units, feature engineering, and similarity measures. Thousands of web articles are being published every second, and therefore, it is essential to determine the similarity of these articles efficiently without wasting time on unnecessary text processing of the bodies. Hence, the proposed approach focuses on short contents (titles) and context. The conducted experiment showed high precision and accuracy on a Really Simple Syndication (RSS) dataset of 8000 Arabic news article pairs collected automatically from 10 different news sources. The proposed approach achieved an accuracy of 0.81. Contextual features increased the accuracy and the precision. The proposed algorithm achieved a 0.89 correlation with the evaluations of two human judges based on Pearson’s Correlation Coefficient. The results outperform the state-of-the-art systems on Arabic news articles.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Annals of Emerging Technologies in Computing
Annals of Emerging Technologies in Computing Computer Science-Computer Science (all)
CiteScore
3.50
自引率
0.00%
发文量
26
期刊最新文献
The Proposal of Countermeasures for DeepFake Voices on Social Media Considering Waveform and Text Embedding Lightweight Model for Occlusion Removal from Face Images A Torpor-based Enhanced Security Model for CSMA/CA Protocol in Wireless Networks Enhancing Robot Navigation Efficiency Using Cellular Automata with Active Cells Wildfire Prediction in the United States Using Time Series Forecasting Models
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1