Locating and parsing bibliographic references in HTML medical articles.

Pub Date : 2010-06-01 DOI:10.1007/s10032-009-0105-9

Jie Zou, Daniel Le, George R Thoma

{"title":"Locating and parsing bibliographic references in HTML medical articles.","authors":"Jie Zou, Daniel Le, George R Thoma","doi":"10.1007/s10032-009-0105-9","DOIUrl":null,"url":null,"abstract":"<p><p>The set of references that typically appear toward the end of journal articles is sometimes, though not always, a field in bibliographic (citation) databases. But even if references do not constitute such a field, they can be useful as a preprocessing step in the automated extraction of other bibliographic data from articles, as well as in computer-assisted indexing of articles. Automation in data extraction and indexing to minimize human labor is key to the affordable creation and maintenance of large bibliographic databases. Extracting the components of references, such as author names, article title, journal name, publication date and other entities, is therefore a valuable and sometimes necessary task. This paper describes a two-step process using statistical machine learning algorithms, to first locate the references in HTML medical articles and then to parse them. Reference locating identifies the reference section in an article and then decomposes it into individual references. We formulate this step as a two-class classification problem based on text and geometric features. An evaluation conducted on 500 articles drawn from 100 medical journals achieves near-perfect precision and recall rates for locating references. Reference parsing identifies the components of each reference. For this second step, we implement and compare two algorithms. One relies on sequence statistics and trains a Conditional Random Field. The other focuses on local feature statistics and trains a Support Vector Machine to classify each individual word, followed by a search algorithm that systematically corrects low confidence labels if the label sequence violates a set of predefined rules. The overall performance of these two reference-parsing algorithms is about the same: above 99% accuracy at the word level, and over 97% accuracy at the chunk level.</p>","PeriodicalId":73486,"journal":{"name":"","volume":"13 2","pages":"107-119"},"PeriodicalIF":0.0,"publicationDate":"2010-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1007/s10032-009-0105-9","citationCount":"28","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10032-009-0105-9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 28

Abstract

The set of references that typically appear toward the end of journal articles is sometimes, though not always, a field in bibliographic (citation) databases. But even if references do not constitute such a field, they can be useful as a preprocessing step in the automated extraction of other bibliographic data from articles, as well as in computer-assisted indexing of articles. Automation in data extraction and indexing to minimize human labor is key to the affordable creation and maintenance of large bibliographic databases. Extracting the components of references, such as author names, article title, journal name, publication date and other entities, is therefore a valuable and sometimes necessary task. This paper describes a two-step process using statistical machine learning algorithms, to first locate the references in HTML medical articles and then to parse them. Reference locating identifies the reference section in an article and then decomposes it into individual references. We formulate this step as a two-class classification problem based on text and geometric features. An evaluation conducted on 500 articles drawn from 100 medical journals achieves near-perfect precision and recall rates for locating references. Reference parsing identifies the components of each reference. For this second step, we implement and compare two algorithms. One relies on sequence statistics and trains a Conditional Random Field. The other focuses on local feature statistics and trains a Support Vector Machine to classify each individual word, followed by a search algorithm that systematically corrects low confidence labels if the label sequence violates a set of predefined rules. The overall performance of these two reference-parsing algorithms is about the same: above 99% accuracy at the word level, and over 97% accuracy at the chunk level.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

定位和解析HTML医学文章中的参考书目。

通常出现在期刊文章末尾的一组参考文献有时(虽然不总是)是书目(引文)数据库中的一个字段。但是，即使参考文献不构成这样一个领域，它们也可以作为从文章中自动提取其他书目数据的预处理步骤，以及在计算机辅助的文章索引中发挥作用。数据提取和索引的自动化以减少人力劳动是创建和维护大型书目数据库的关键。因此，提取参考文献的组成部分，如作者姓名、文章标题、期刊名称、出版日期和其他实体，是一项有价值的、有时是必要的任务。本文描述了一个使用统计机器学习算法的两步过程，首先定位HTML医学文章中的参考文献，然后对其进行解析。参考文献定位识别文章中的参考文献部分，然后将其分解为单独的参考文献。我们将这一步表述为基于文本和几何特征的两类分类问题。对取自100个医学期刊的500篇文章进行的评估在定位参考文献方面达到了近乎完美的精确度和召回率。引用解析识别每个引用的组件。对于第二步，我们实现并比较两种算法。一个依赖于序列统计并训练一个条件随机场。另一种方法侧重于局部特征统计，并训练一个支持向量机对每个单独的词进行分类，然后是一个搜索算法，如果标签序列违反了一组预定义的规则，则系统地纠正低置信度标签。这两种引用解析算法的总体性能大致相同:在单词级别上准确率超过99%，在块级别上准确率超过97%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助