Automatic detection and extraction of key resources from tables in biomedical papers.

IF 6.1 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Biodata Mining Pub Date : 2025-03-20 DOI:10.1186/s13040-025-00438-9

Ibrahim Burak Ozyurt, Anita Bandrowski

{"title":"Automatic detection and extraction of key resources from tables in biomedical papers.","authors":"Ibrahim Burak Ozyurt, Anita Bandrowski","doi":"10.1186/s13040-025-00438-9","DOIUrl":null,"url":null,"abstract":"Background: Tables are useful information artifacts that allow easy detection of missing data and have been deployed by several publishers to improve the amount of information present for key resources and reagents such as antibodies, cell lines, and other tools that constitute the inputs to a study. STAR*Methods key resource tables have increased the \"findability\" of these key resources, improving transparency of the paper by warning authors (before publication) about any problems, such as key resources that cannot be uniquely identified or those that are known to be problematic, but they have not been commonly available outside of the Cell Press journal family. We believe that processing preprints and adding these 'resource table candidates' automatically will improve the availability of structured and linked information about research resources in a broader swath of the scientific literature. However, if the authors have already added a key resource table, that table must be detected, and each entity must be correctly identified and faithfully restructured into a standard format.Methods: We introduce four end-to-end table extraction pipelines to extract and faithfully reconstruct key resource tables from biomedical papers in PDF format. The pipelines employ machine learning approaches for key resource table page identification, \"Table Transformer\" models for table detection, and table structure recognition. We also introduce a character-level generative pre-trained transformer (GPT) language model for scientific tables pre-trained on over 11 million scientific tables. We fine-tuned our table-specific language model with synthetic training data generated with a novel approach to alleviate row over-segmentation significantly improving key resource extraction performance.Results: The extraction of key resource tables in PDF files by the popular GROBID tool resulted in a Grid Table Similarity (GriTS) score of 0.12. All of our pipelines have outperformed GROBID by a large margin. Our best pipeline with table-specific language model-based row merger achieved a GriTS score of 0.90.Conclusions: Our pipelines allow the detection and extraction of key resources from tables with much higher accuracy, enabling the deployment of automated research resource extraction tools on BioRxiv to help authors correct unidentifiable key resources detected in their articles and improve the reproducibility of their findings. The code, table-specific language model, annotated training and evaluation data are publicly available.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"23"},"PeriodicalIF":6.1000,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11924859/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodata Mining","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13040-025-00438-9","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Tables are useful information artifacts that allow easy detection of missing data and have been deployed by several publishers to improve the amount of information present for key resources and reagents such as antibodies, cell lines, and other tools that constitute the inputs to a study. STAR*Methods key resource tables have increased the "findability" of these key resources, improving transparency of the paper by warning authors (before publication) about any problems, such as key resources that cannot be uniquely identified or those that are known to be problematic, but they have not been commonly available outside of the Cell Press journal family. We believe that processing preprints and adding these 'resource table candidates' automatically will improve the availability of structured and linked information about research resources in a broader swath of the scientific literature. However, if the authors have already added a key resource table, that table must be detected, and each entity must be correctly identified and faithfully restructured into a standard format.

Methods: We introduce four end-to-end table extraction pipelines to extract and faithfully reconstruct key resource tables from biomedical papers in PDF format. The pipelines employ machine learning approaches for key resource table page identification, "Table Transformer" models for table detection, and table structure recognition. We also introduce a character-level generative pre-trained transformer (GPT) language model for scientific tables pre-trained on over 11 million scientific tables. We fine-tuned our table-specific language model with synthetic training data generated with a novel approach to alleviate row over-segmentation significantly improving key resource extraction performance.

Results: The extraction of key resource tables in PDF files by the popular GROBID tool resulted in a Grid Table Similarity (GriTS) score of 0.12. All of our pipelines have outperformed GROBID by a large margin. Our best pipeline with table-specific language model-based row merger achieved a GriTS score of 0.90.

Conclusions: Our pipelines allow the detection and extraction of key resources from tables with much higher accuracy, enabling the deployment of automated research resource extraction tools on BioRxiv to help authors correct unidentifiable key resources detected in their articles and improve the reproducibility of their findings. The code, table-specific language model, annotated training and evaluation data are publicly available.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

生物医学论文表格关键资源的自动检测与提取。

背景：表格是有用的信息工件，可以很容易地检测到缺失的数据，并且已经被一些出版商部署，以提高关键资源和试剂（如抗体、细胞系和其他构成研究输入的工具）的现有信息量。STAR*Methods关键资源表增加了这些关键资源的“可查找性”，通过（在发表前）警告作者任何问题来提高论文的透明度，例如不能唯一识别的关键资源或已知有问题的关键资源，但这些资源在Cell Press期刊家族之外通常无法获得。我们相信，处理预印本并自动添加这些“资源候选表”将提高更广泛的科学文献中有关研究资源的结构化和链接信息的可用性。但是，如果作者已经添加了一个键资源表，则必须检测到该表，并且必须正确识别每个实体并忠实地将其重新构造为标准格式。方法：引入4个端到端表提取管道，从PDF格式的生物医学论文中提取并忠实地重建关键资源表。管道使用机器学习方法进行关键资源表页面识别，使用“表转换器”模型进行表检测和表结构识别。我们还介绍了一个字符级生成预训练转换（GPT）语言模型，用于在超过1100万个科学表上进行预训练的科学表。我们用一种新方法生成的合成训练数据对特定于表的语言模型进行了微调，以减轻行过度分割，显著提高关键资源提取性能。结果：使用流行的GROBID工具提取PDF文件中的关键资源表，网格表相似度（GriTS）得分为0.12。我们所有的管道都大大超过了GROBID。我们最好的基于表特定语言模型的行合并管道获得了0.90的GriTS分数。结论：我们的管道允许以更高的准确性从表中检测和提取关键资源，使自动化研究资源提取工具能够在BioRxiv上部署，以帮助作者纠正在其文章中检测到的无法识别的关键资源，并提高其发现的可重复性。代码、特定于表的语言模型、带注释的训练和评估数据都是公开的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Biodata Mining MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

7.90

自引率

0.00%

发文量

审稿时长

23 weeks

期刊介绍： BioData Mining is an open access, open peer-reviewed journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data. Topical areas include, but are not limited to: -Development, evaluation, and application of novel data mining and machine learning algorithms. -Adaptation, evaluation, and application of traditional data mining and machine learning algorithms. -Open-source software for the application of data mining and machine learning algorithms. -Design, development and integration of databases, software and web services for the storage, management, retrieval, and analysis of data from large scale studies. -Pre-processing, post-processing, modeling, and interpretation of data mining and machine learning results for biological interpretation and knowledge discovery.