{"title":"CONCORD: enhancing COVID-19 research with weak-supervision based numerical claim extraction","authors":"Dhwanil Shah, Krish Shah, Manan Jagani, Agam Shah, Bhaskar Chaudhury","doi":"10.1007/s10844-024-00885-6","DOIUrl":null,"url":null,"abstract":"<p>The <b>CO</b>VID-19 <b>N</b>umerical <b>C</b>laims <b>O</b>pen <b>R</b>esearch <b>D</b>ataset (CONCORD) is a comprehensive, open-source dataset that extracts numerical claims from academic papers on COVID-19 research. A weak-supervision model is employed for this extraction, taking advantage of its white-box, explainable nature and reduced computational and annotation costs compared to transformer-based models. This model uses labelling functions such as pattern matching, external knowledge bases, phrase matching, and third-party models to generate labels, with an aggregator function handling contradictory labels. Evaluated against established baselines, the model achieved a weighted F1-score of 0.932 and a micro F1-score of 0.930. While transformer-based models achieve comparable results, the explainability of weak-supervision offers distinct advantages. Additionally, generative LLMs were tested to understand their effectiveness in extracting numerical claims, highlighting the impact of prompt engineering on performance. CONCORD contains approximately 200,000 numerical claims from over 57,000 COVID-19 research articles, serving as a valuable resource for tracking developments in COVID-19 research. This dataset, coupled with the weak-supervision approach, provides researchers with a significant tool for advancing COVID-19 research and showcases the potential of these methodologies in the broader biomedical field.</p>","PeriodicalId":56119,"journal":{"name":"Journal of Intelligent Information Systems","volume":"11 1","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Intelligent Information Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10844-024-00885-6","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
The COVID-19 Numerical Claims Open Research Dataset (CONCORD) is a comprehensive, open-source dataset that extracts numerical claims from academic papers on COVID-19 research. A weak-supervision model is employed for this extraction, taking advantage of its white-box, explainable nature and reduced computational and annotation costs compared to transformer-based models. This model uses labelling functions such as pattern matching, external knowledge bases, phrase matching, and third-party models to generate labels, with an aggregator function handling contradictory labels. Evaluated against established baselines, the model achieved a weighted F1-score of 0.932 and a micro F1-score of 0.930. While transformer-based models achieve comparable results, the explainability of weak-supervision offers distinct advantages. Additionally, generative LLMs were tested to understand their effectiveness in extracting numerical claims, highlighting the impact of prompt engineering on performance. CONCORD contains approximately 200,000 numerical claims from over 57,000 COVID-19 research articles, serving as a valuable resource for tracking developments in COVID-19 research. This dataset, coupled with the weak-supervision approach, provides researchers with a significant tool for advancing COVID-19 research and showcases the potential of these methodologies in the broader biomedical field.
期刊介绍:
The mission of the Journal of Intelligent Information Systems: Integrating Artifical Intelligence and Database Technologies is to foster and present research and development results focused on the integration of artificial intelligence and database technologies to create next generation information systems - Intelligent Information Systems.
These new information systems embody knowledge that allows them to exhibit intelligent behavior, cooperate with users and other systems in problem solving, discovery, access, retrieval and manipulation of a wide variety of multimedia data and knowledge, and reason under uncertainty. Increasingly, knowledge-directed inference processes are being used to:
discover knowledge from large data collections,
provide cooperative support to users in complex query formulation and refinement,
access, retrieve, store and manage large collections of multimedia data and knowledge,
integrate information from multiple heterogeneous data and knowledge sources, and
reason about information under uncertain conditions.
Multimedia and hypermedia information systems now operate on a global scale over the Internet, and new tools and techniques are needed to manage these dynamic and evolving information spaces.
The Journal of Intelligent Information Systems provides a forum wherein academics, researchers and practitioners may publish high-quality, original and state-of-the-art papers describing theoretical aspects, systems architectures, analysis and design tools and techniques, and implementation experiences in intelligent information systems. The categories of papers published by JIIS include: research papers, invited papters, meetings, workshop and conference annoucements and reports, survey and tutorial articles, and book reviews. Short articles describing open problems or their solutions are also welcome.