Bigram Label Regularization to Reduce Over-Segmentation on Inline Math Expression Detection

2019 International Conference on Document Analysis and Recognition (ICDAR) Pub Date : 2019-09-01 DOI:10.1109/ICDAR.2019.00069

Xing Wang, Zelun Wang, Jyh-Charn S. Liu

引用次数: 6

Abstract

Inline Mathematical Expression refers to Math Expression (ME) that is blended into plaintext sentences in scientific papers. Detecting inline MEs is a non-trivial problem due to the unrestricted usage of font styles and blurred boundaries with plaintext in scientific publications. For instance, many inline MEs detected by existing algorithms are split into multiple parts incorrectly due to the misidentification of a few characters. In this paper, we propose a bigram regularization model to resolve the split problem in inline ME detection. The model incorporates neighboring constraints during labeling of ME vs. plaintext. Experimental results show that this technique significantly reduces the splits of inline MEs, with small gains in the false and miss rate. In comparison with a CRF model, our model achieves a higher F1 score and a lower miss rate.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

双图标签正则化减少内联数学表达式检测的过度分割

内联数学表达式(Inline Mathematical Expression，简称ME)是指将数学表达式混合到科技论文的明文语句中。由于科学出版物中字体样式的无限制使用和明文的模糊边界，检测内联MEs是一个非常重要的问题。例如，现有算法检测到的许多内联MEs由于对几个字符的错误识别而被错误地分成多个部分。本文提出了一种双图正则化模型来解决内联ME检测中的分割问题。该模型在标记ME与明文时结合了相邻约束。实验结果表明，该技术显著降低了内联MEs的分裂，在误报率和漏报率上有较小的提高。与CRF模型相比，我们的模型获得了更高的F1分数和更低的缺失率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2019 International Conference on Document Analysis and Recognition (ICDAR)

自引率

0.00%

发文量