BRYT: Automated keyword extraction for open datasets

Umair Ahmed , Charalampos Alexopoulos , Marco Piangerelli , Andrea Polini
{"title":"BRYT: Automated keyword extraction for open datasets","authors":"Umair Ahmed ,&nbsp;Charalampos Alexopoulos ,&nbsp;Marco Piangerelli ,&nbsp;Andrea Polini","doi":"10.1016/j.iswa.2024.200421","DOIUrl":null,"url":null,"abstract":"<div><p>In today’s information-driven world, open data is crucial in making valuable structured data freely accessible to the public. However, the absence of quality metadata often hinders the findability and representation of this data. In this study we specifically focus on keywords, proposing a strategy for their automatic generation. In particular, we employed five existing keyword extraction methodologies (BERT, RAKE, YAKE, TEXTRANK, and ChatGPT) and proposed a novel hybrid methodology, named BRYT (read as bright). Our evaluation of these algorithms was conducted using Gestalt String Matching and Jaccard Similarity techniques. We validated our study using a selection of datasets from the EU data portal, specifically choosing those that exhibited potentially high-quality metadata. This included datasets that contained a substantial number of keywords and had comprehensive, relevant metadata. The results showed that 69.1% of the dataset keywords majorly matched (more than 50% or 5 keywords), 24.7% minorly matched (up to 50% or 5 keywords), and 6.2% did not match. The proposed hybrid model, BRYT, outperformed other algorithms in the major matches, while ChatGPT was a close second. YAKE outperformed the others in minor matches, and ChatGPT was again a close second. The evaluations concluded that BRYT consistently extracted more representative keywords in major matches, highlighting its effectiveness in improving findability. This study sets up a favorable field for further representative metadata extraction and population, making the data more findable, discoverable, and accessible.</p></div>","PeriodicalId":100684,"journal":{"name":"Intelligent Systems with Applications","volume":"23 ","pages":"Article 200421"},"PeriodicalIF":0.0000,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667305324000954/pdfft?md5=bffa7bce793407b80d4f01fce6471a60&pid=1-s2.0-S2667305324000954-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent Systems with Applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667305324000954","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

In today’s information-driven world, open data is crucial in making valuable structured data freely accessible to the public. However, the absence of quality metadata often hinders the findability and representation of this data. In this study we specifically focus on keywords, proposing a strategy for their automatic generation. In particular, we employed five existing keyword extraction methodologies (BERT, RAKE, YAKE, TEXTRANK, and ChatGPT) and proposed a novel hybrid methodology, named BRYT (read as bright). Our evaluation of these algorithms was conducted using Gestalt String Matching and Jaccard Similarity techniques. We validated our study using a selection of datasets from the EU data portal, specifically choosing those that exhibited potentially high-quality metadata. This included datasets that contained a substantial number of keywords and had comprehensive, relevant metadata. The results showed that 69.1% of the dataset keywords majorly matched (more than 50% or 5 keywords), 24.7% minorly matched (up to 50% or 5 keywords), and 6.2% did not match. The proposed hybrid model, BRYT, outperformed other algorithms in the major matches, while ChatGPT was a close second. YAKE outperformed the others in minor matches, and ChatGPT was again a close second. The evaluations concluded that BRYT consistently extracted more representative keywords in major matches, highlighting its effectiveness in improving findability. This study sets up a favorable field for further representative metadata extraction and population, making the data more findable, discoverable, and accessible.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
BRYT:开放数据集的自动关键词提取
在当今信息驱动的世界中,开放数据对于向公众免费提供有价值的结构化数据至关重要。然而,缺乏高质量的元数据往往会阻碍这些数据的可查找性和代表性。在这项研究中,我们特别关注关键词,并提出了一种自动生成关键词的策略。特别是,我们采用了五种现有的关键词提取方法(BERT、RAKE、YAKE、TEXTRANK 和 ChatGPT),并提出了一种新颖的混合方法,命名为 BRYT(read as bright)。我们使用格式塔字符串匹配和 Jaccard 相似性技术对这些算法进行了评估。我们从欧盟数据门户网站中选择了一些数据集,特别是那些显示出潜在高质量元数据的数据集,对我们的研究进行了验证。其中包括包含大量关键字和全面相关元数据的数据集。结果显示,69.1% 的数据集关键词基本匹配(50% 以上或 5 个关键词),24.7% 轻微匹配(50% 以下或 5 个关键词),6.2% 不匹配。拟议的混合模型 BRYT 在主要匹配度方面优于其他算法,而 ChatGPT 紧随其后。在次要匹配中,YAKE 的表现优于其他算法,而 ChatGPT 紧随其后。评估得出的结论是,在主要匹配中,BRYT 始终能提取出更具代表性的关键词,这凸显了它在提高可查找性方面的有效性。这项研究为进一步提取有代表性的元数据和数据集奠定了良好的基础,使数据更易于查找、发现和访问。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
5.60
自引率
0.00%
发文量
0
期刊最新文献
MapReduce teaching learning based optimization algorithm for solving CEC-2013 LSGO benchmark Testsuit Intelligent gear decision method for vehicle automatic transmission system based on data mining Design and implementation of EventsKG for situational monitoring and security intelligence in India: An open-source intelligence gathering approach Ideological orientation and extremism detection in online social networking sites: A systematic review Multi-objective optimization of power networks integrating electric vehicles and wind energy
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1