{"title":"Bigram Label Regularization to Reduce Over-Segmentation on Inline Math Expression Detection","authors":"Xing Wang, Zelun Wang, Jyh-Charn S. Liu","doi":"10.1109/ICDAR.2019.00069","DOIUrl":null,"url":null,"abstract":"Inline Mathematical Expression refers to Math Expression (ME) that is blended into plaintext sentences in scientific papers. Detecting inline MEs is a non-trivial problem due to the unrestricted usage of font styles and blurred boundaries with plaintext in scientific publications. For instance, many inline MEs detected by existing algorithms are split into multiple parts incorrectly due to the misidentification of a few characters. In this paper, we propose a bigram regularization model to resolve the split problem in inline ME detection. The model incorporates neighboring constraints during labeling of ME vs. plaintext. Experimental results show that this technique significantly reduces the splits of inline MEs, with small gains in the false and miss rate. In comparison with a CRF model, our model achieves a higher F1 score and a lower miss rate.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"150 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Document Analysis and Recognition (ICDAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2019.00069","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
Inline Mathematical Expression refers to Math Expression (ME) that is blended into plaintext sentences in scientific papers. Detecting inline MEs is a non-trivial problem due to the unrestricted usage of font styles and blurred boundaries with plaintext in scientific publications. For instance, many inline MEs detected by existing algorithms are split into multiple parts incorrectly due to the misidentification of a few characters. In this paper, we propose a bigram regularization model to resolve the split problem in inline ME detection. The model incorporates neighboring constraints during labeling of ME vs. plaintext. Experimental results show that this technique significantly reduces the splits of inline MEs, with small gains in the false and miss rate. In comparison with a CRF model, our model achieves a higher F1 score and a lower miss rate.