From Web to RheumaLpack: Creating a Linguistic Corpus for Exploitation and Knowledge Discovery in Rheumatology

Alfredo Madrid-García, Beatriz Merino-Barbancho, Dalifer Freites-Núñez, Luis Rodríguez-Rodríguez, Ernestina Menasalvas-Ruíz, Alejandro Rodríguez-González, Anselmo Peñas
{"title":"From Web to RheumaLpack: Creating a Linguistic Corpus for Exploitation and Knowledge Discovery in Rheumatology","authors":"Alfredo Madrid-García, Beatriz Merino-Barbancho, Dalifer Freites-Núñez, Luis Rodríguez-Rodríguez, Ernestina Menasalvas-Ruíz, Alejandro Rodríguez-González, Anselmo Peñas","doi":"10.1101/2024.04.26.24306269","DOIUrl":null,"url":null,"abstract":"This study introduces <em>RheumaLinguisticpack</em> (<em>RheumaLpack</em>), the first specialised linguistic web corpus designed for the field of musculoskeletal disorders. By combining web mining (i.e., web scraping) and natural language processing (NLP) techniques, as well as clinical expertise, <em>RheumaL-pack</em> systematically captures and curates data across a spectrum of web sources including clinical trials registers (i.e., ClinicalTrials.gov), bibliographic databases (i.e., PubMed), medical agencies (i.e. EMA), social media (i.e., Reddit), and accredited health websites (i.e., MedlinePlus, Hardvard Health Publishing, and Cleveland Clinic). Given the complexity of rheumatic and musculoskeletal diseases (RMDs) and their significant impact on quality of life, this resource can be proposed as a useful tool to train algorithms that could mitigate the diseases’ effects. Therefore, the corpus aims to improve the training of artificial intelligence (AI) algorithms and facilitate knowledge discovery in RMDs. The development of <em>RheumaLpack</em> involved a systematic six-step methodology covering data identification, characterisation, selection, collection, processing, and corpus description. The result is a non-annotated, monolingual, and dynamic corpus, featuring almost 3 million records spanning from 2000 to 2023. <em>RheumaLpack</em> represents a pioneering contribution to rheumatology research, providing a useful resource for the development of advanced AI and NLP applications. This corpus highlights the value of web data to address the challenges posed by musculoskeletal diseases, illustrating the corpus’s potential to improve research and treatment paradigms in rheumatology. Finally, the methodology shown can be replicated to obtain data from other medical specialities. The code and details on how to build <em>RheumaL(inguistic)pack</em> are also provided to facilitate the dissemination of such resource.","PeriodicalId":501212,"journal":{"name":"medRxiv - Rheumatology","volume":"59 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv - Rheumatology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.04.26.24306269","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

This study introduces RheumaLinguisticpack (RheumaLpack), the first specialised linguistic web corpus designed for the field of musculoskeletal disorders. By combining web mining (i.e., web scraping) and natural language processing (NLP) techniques, as well as clinical expertise, RheumaL-pack systematically captures and curates data across a spectrum of web sources including clinical trials registers (i.e., ClinicalTrials.gov), bibliographic databases (i.e., PubMed), medical agencies (i.e. EMA), social media (i.e., Reddit), and accredited health websites (i.e., MedlinePlus, Hardvard Health Publishing, and Cleveland Clinic). Given the complexity of rheumatic and musculoskeletal diseases (RMDs) and their significant impact on quality of life, this resource can be proposed as a useful tool to train algorithms that could mitigate the diseases’ effects. Therefore, the corpus aims to improve the training of artificial intelligence (AI) algorithms and facilitate knowledge discovery in RMDs. The development of RheumaLpack involved a systematic six-step methodology covering data identification, characterisation, selection, collection, processing, and corpus description. The result is a non-annotated, monolingual, and dynamic corpus, featuring almost 3 million records spanning from 2000 to 2023. RheumaLpack represents a pioneering contribution to rheumatology research, providing a useful resource for the development of advanced AI and NLP applications. This corpus highlights the value of web data to address the challenges posed by musculoskeletal diseases, illustrating the corpus’s potential to improve research and treatment paradigms in rheumatology. Finally, the methodology shown can be replicated to obtain data from other medical specialities. The code and details on how to build RheumaL(inguistic)pack are also provided to facilitate the dissemination of such resource.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
从网络到 RheumaLpack:创建用于风湿病学开发和知识发现的语言语料库
本研究介绍的 RheumaLinguisticpack(RheumaLpack)是首个专为肌肉骨骼疾病领域设计的专业语言网络语料库。通过结合网络挖掘(即网络刮削)和自然语言处理(NLP)技术以及临床专业知识,RheumaL-pack 系统地捕获和整理了各种网络来源的数据,包括临床试验登记(即 ClinicalTrials.gov、ClinicalTrials.gov)、书目数据库(如 PubMed)、医疗机构(如 EMA)、社交媒体(如 Reddit)和认证健康网站(如 MedlinePlus、Hardvard Health Publishing 和 Cleveland Clinic)。鉴于风湿病和肌肉骨骼疾病(RMDs)的复杂性及其对生活质量的重大影响,该资源可作为一种有用的工具,用于训练可减轻疾病影响的算法。因此,该语料库旨在改进人工智能(AI)算法的训练,促进 RMDs 方面的知识发现。RheumaLpack 的开发涉及系统的六步方法,包括数据识别、特征描述、选择、收集、处理和语料库描述。最终形成了一个非注释、单语和动态的语料库,包含从 2000 年到 2023 年的近 300 万条记录。RheumaLpack 是对风湿病学研究的开创性贡献,为开发高级人工智能和 NLP 应用程序提供了有用的资源。该语料库凸显了网络数据在应对肌肉骨骼疾病挑战方面的价值,说明了该语料库在改进风湿病学研究和治疗模式方面的潜力。最后,所展示的方法可以复制,以获取其他医学专业的数据。此外,还提供了如何构建 RheumaL(inguistic)pack 的代码和详细信息,以促进此类资源的传播。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Mosaic loss of chromosome Y characterizes late-onset rheumatoid arthritis and contrasting associations of polygenic risk score based on age at onset. Use of Metagenomic Microbial Plasma Cell-Free DNA Next-Generation Sequencing Assay in Outpatient Rheumatology Practice Proteomic profiling of the large vessel vasculitis spectrum identifies shared signatures of innate immune activation and stromal remodelling Pre-trained convolutional neural network with transfer learning by artificial illustrated images classify power Doppler ultrasound images of rheumatoid arthritis joints Associations between exposure to OPEs and rheumatoid arthritis risk among adults in NHANES, 2011-2018
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1