Xing Wang, Jason Lin, Ryan Vrecenar, Jyh-Charn S. Liu
Mathematical Expressions (ME) and words are carefully bonded in technical writing to characterize physical concepts and their interactions quantitatively, and qualitatively. This paper proposes the Qualitative-Quantitative (QuQn) map as an abstraction of scientific papers to depict the dependency among MEs and their most related adjacent words. QuQn map aims to offer a succinct representation of the reasoning logic flow in a paper. Various filters can be applied to a QuQn map to reduce redundant/indirect links, control the display of problem settings (simple ME variables with declaration), and prune nodes with specific topological properties such as the largest connected subgraph. We developed a visualization tool prototype to support interactive browsing of the technical contents at different granularities of detail.
{"title":"QuQn map: Qualitative-Quantitative mapping of scientific papers","authors":"Xing Wang, Jason Lin, Ryan Vrecenar, Jyh-Charn S. Liu","doi":"10.1145/3209280.3229116","DOIUrl":"https://doi.org/10.1145/3209280.3229116","url":null,"abstract":"Mathematical Expressions (ME) and words are carefully bonded in technical writing to characterize physical concepts and their interactions quantitatively, and qualitatively. This paper proposes the Qualitative-Quantitative (QuQn) map as an abstraction of scientific papers to depict the dependency among MEs and their most related adjacent words. QuQn map aims to offer a succinct representation of the reasoning logic flow in a paper. Various filters can be applied to a QuQn map to reduce redundant/indirect links, control the display of problem settings (simple ME variables with declaration), and prune nodes with specific topological properties such as the largest connected subgraph. We developed a visualization tool prototype to support interactive browsing of the technical contents at different granularities of detail.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122425815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sitong Chen, A. Mohammad, Seyednaser Nourashrafeddin, E. Milios
In this paper, we propose a high recall active document retrieval system for a class of applications involving query documents, as opposed to key terms, and domain-specific document corpora. The output of the model is a list of documents retrieved based on the domain expert feedback collected during training. A modified version of Bag of Word (BoW) representation and a semantic ranking module, based on Google n-grams, are used in the model. The core of the system is a binary document classification model which is trained through a continuous active learning strategy. In general, finding or constructing training data for this type of problem is very difficult due to either confidentiality of the data, or the need for domain expert time to label data. Our experimental results on the retrieval of Call For Papers based on a manuscript demonstrate the efficacy of the system to address this application and its performance compared to other candidate models.
{"title":"Active High-Recall Information Retrieval from Domain-Specific Text Corpora based on Query Documents","authors":"Sitong Chen, A. Mohammad, Seyednaser Nourashrafeddin, E. Milios","doi":"10.1145/3209280.3209532","DOIUrl":"https://doi.org/10.1145/3209280.3209532","url":null,"abstract":"In this paper, we propose a high recall active document retrieval system for a class of applications involving query documents, as opposed to key terms, and domain-specific document corpora. The output of the model is a list of documents retrieved based on the domain expert feedback collected during training. A modified version of Bag of Word (BoW) representation and a semantic ranking module, based on Google n-grams, are used in the model. The core of the system is a binary document classification model which is trained through a continuous active learning strategy. In general, finding or constructing training data for this type of problem is very difficult due to either confidentiality of the data, or the need for domain expert time to label data. Our experimental results on the retrieval of Call For Papers based on a manuscript demonstrate the efficacy of the system to address this application and its performance compared to other candidate models.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130672364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Moreno, Luiz José Schirmer Silva, M. D. Bayser, R. Brandão, Renato F. G. Cerqueira
Finding concepts considering their meaning and semantic relations in a document corpus is an important and challenging task. In this paper, we present our contributions on how to understand unstructured data present in one or multiple documents. Generally, the current literature concentrates efforts in structuring knowledge by identifying semantic entities in the data. In this paper, we test our hypothesis that hyperknowledge specifications are capable of defining rich relations among documents and extracted facts. The main evidence supporting this hypothesis is the fact that hyperknowledge was built on top of hypermedia fundamentals, easing the specification of rich relationships between different multimodal components (i.e. multimedia content and knowledge entities). The key challenge tackled in this paper is how to structure and correlate these components considering their meaning and semantic relations.
{"title":"Understanding Documents with Hyperknowledge Specifications","authors":"M. Moreno, Luiz José Schirmer Silva, M. D. Bayser, R. Brandão, Renato F. G. Cerqueira","doi":"10.1145/3209280.3229118","DOIUrl":"https://doi.org/10.1145/3209280.3229118","url":null,"abstract":"Finding concepts considering their meaning and semantic relations in a document corpus is an important and challenging task. In this paper, we present our contributions on how to understand unstructured data present in one or multiple documents. Generally, the current literature concentrates efforts in structuring knowledge by identifying semantic entities in the data. In this paper, we test our hypothesis that hyperknowledge specifications are capable of defining rich relations among documents and extracted facts. The main evidence supporting this hypothesis is the fact that hyperknowledge was built on top of hypermedia fundamentals, easing the specification of rich relationships between different multimodal components (i.e. multimedia content and knowledge entities). The key challenge tackled in this paper is how to structure and correlate these components considering their meaning and semantic relations.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123414145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this tutorial, we consider important aspects (algorithms, approaches, considerations) for tagging both unstructured and structured text for downstream use. This includes summarization, in which text information is compressed for more efficient archiving, searching, and clustering. In the tutorial, we focus on the topic of automatic text summarization, covering the most important milestones of the six decades of research in this area.
{"title":"Automatic Text Summarization and Classification","authors":"S. Simske, R. Lins","doi":"10.1145/3209280.3232791","DOIUrl":"https://doi.org/10.1145/3209280.3232791","url":null,"abstract":"In this tutorial, we consider important aspects (algorithms, approaches, considerations) for tagging both unstructured and structured text for downstream use. This includes summarization, in which text information is compressed for more efficient archiving, searching, and clustering. In the tutorial, we focus on the topic of automatic text summarization, covering the most important milestones of the six decades of research in this area.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127855608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eric M. Domke, J. Leidig, Gregory Schymik, G. Wolffe
Although web search remains an active research area, interest in enterprise search has not kept up with the information requirements of the contemporary workforce. To address these issues, this research aims to develop, implement, and study the query expansion techniques most effective at improving relevancy in enterprise search. The case-study instrument was a custom Apache Solr-based search application deployed at a medium-sized manufacturing company. It was hypothesized that a composition of techniques tailored to enterprise content and information needs would prove effective in increasing relevancy evaluation scores. Query expansion techniques leveraging entity recognition, alphanumeric term identification, and intent classification were implemented and studied using real enterprise content and query logs. They were evaluated against a set of test queries derived from relevance survey results using standard relevancy metrics such as normalized discounted cumulative gain (nDCG). Each of these modules produced meaningful and statistically significant improvements in relevancy.
{"title":"Query Expansion in Enterprise Search","authors":"Eric M. Domke, J. Leidig, Gregory Schymik, G. Wolffe","doi":"10.1145/3209280.3229111","DOIUrl":"https://doi.org/10.1145/3209280.3229111","url":null,"abstract":"Although web search remains an active research area, interest in enterprise search has not kept up with the information requirements of the contemporary workforce. To address these issues, this research aims to develop, implement, and study the query expansion techniques most effective at improving relevancy in enterprise search. The case-study instrument was a custom Apache Solr-based search application deployed at a medium-sized manufacturing company. It was hypothesized that a composition of techniques tailored to enterprise content and information needs would prove effective in increasing relevancy evaluation scores. Query expansion techniques leveraging entity recognition, alphanumeric term identification, and intent classification were implemented and studied using real enterprise content and query logs. They were evaluated against a set of test queries derived from relevance survey results using standard relevancy metrics such as normalized discounted cumulative gain (nDCG). Each of these modules produced meaningful and statistically significant improvements in relevancy.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127294454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cannannore Nidhi Narayana Kamath, S. S. Bukhari, A. Dengel
In this contemporaneous world, it is an obligation for any organization working with documents to end up with the insipid task of classifying truckload of documents, which is the nascent stage of venturing into the realm of information retrieval and data mining. But classification of such humongous documents into multiple classes, calls for a lot of time and labor. Hence a system which could classify these documents with acceptable accuracy would be of an unfathomable help in document engineering. We have created multiple classifiers for document classification and compared their accuracy on raw and processed data. We have garnered data used in a corporate organization as well as publicly available data for comparison. Data is processed by removing the stop-words and stemming is implemented to produce root words. Multiple traditional machine learning techniques like Naive Bayes, Logistic Regression, Support Vector Machine, Random forest Classifier and Multi-Layer Perceptron are used for classification of documents. Classifiers are applied on raw and processed data separately and their accuracy is noted. Along with this, Deep learning technique such as Convolution Neural Network is also used to classify the data and its accuracy is compared with that of traditional machine learning techniques. We are also exploring hierarchical classifiers for classification of classes and subclasses. The system classifies the data faster and with better accuracy than if done manually. The results are discussed in the results and evaluation section.
{"title":"Comparative Study between Traditional Machine Learning and Deep Learning Approaches for Text Classification","authors":"Cannannore Nidhi Narayana Kamath, S. S. Bukhari, A. Dengel","doi":"10.1145/3209280.3209526","DOIUrl":"https://doi.org/10.1145/3209280.3209526","url":null,"abstract":"In this contemporaneous world, it is an obligation for any organization working with documents to end up with the insipid task of classifying truckload of documents, which is the nascent stage of venturing into the realm of information retrieval and data mining. But classification of such humongous documents into multiple classes, calls for a lot of time and labor. Hence a system which could classify these documents with acceptable accuracy would be of an unfathomable help in document engineering. We have created multiple classifiers for document classification and compared their accuracy on raw and processed data. We have garnered data used in a corporate organization as well as publicly available data for comparison. Data is processed by removing the stop-words and stemming is implemented to produce root words. Multiple traditional machine learning techniques like Naive Bayes, Logistic Regression, Support Vector Machine, Random forest Classifier and Multi-Layer Perceptron are used for classification of documents. Classifiers are applied on raw and processed data separately and their accuracy is noted. Along with this, Deep learning technique such as Convolution Neural Network is also used to classify the data and its accuracy is compared with that of traditional machine learning techniques. We are also exploring hierarchical classifiers for classification of classes and subclasses. The system classifies the data faster and with better accuracy than if done manually. The results are discussed in the results and evaluation section.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130055272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Each day, a vast amount of data is published on the web. In addition, the rate at which content is being published is growing, which has the potential to overwhelm users, particularly those who are technically unskilled. Furthermore, users from various domains of expertise face challenges when trying to retrieve the data they require. They may rely on IT experts, but these experts have limited knowledge of individual domains, making data extraction a time-consuming and error-prone task. It would be beneficial if domain experts were able to retrieve needed data and create relatively complex queries on top of web documents. The existing query solutions either are limited to a specific domain or require beginning with a predefined knowledge base or sample ontologies. To address these limitations, we propose a goal-oriented platform that enables users to easily extract data from web documents. This platform enables users to express their goals in natural language, after which the platform elicits the corresponding result type using the algorithm proposed. The platform also applies the concept of ontology to semantically improve search results. To retrieve the most relevant results from web documents, the segments of a user's query are mapped to the entities of the ontology. Two types of ontologies are used: goal ontologies and domain-specific ones, which comprise domain concepts and the relationships among them. In addition, the platform helps domain experts to generate the domain ontologies that will be used to extract data from web documents. Placing ontologies at the center of the approach integrates a level of semantics into the platform, resulting in more-precise output. The main contributions of this research are that it provides a goal-oriented platform for extracting data from web documents and integrates ontology-based development into web-document searches.
{"title":"GOWDA","authors":"Bahareh Zarei, M. Gaedke","doi":"10.1145/3209280.3229099","DOIUrl":"https://doi.org/10.1145/3209280.3229099","url":null,"abstract":"Each day, a vast amount of data is published on the web. In addition, the rate at which content is being published is growing, which has the potential to overwhelm users, particularly those who are technically unskilled. Furthermore, users from various domains of expertise face challenges when trying to retrieve the data they require. They may rely on IT experts, but these experts have limited knowledge of individual domains, making data extraction a time-consuming and error-prone task. It would be beneficial if domain experts were able to retrieve needed data and create relatively complex queries on top of web documents. The existing query solutions either are limited to a specific domain or require beginning with a predefined knowledge base or sample ontologies. To address these limitations, we propose a goal-oriented platform that enables users to easily extract data from web documents. This platform enables users to express their goals in natural language, after which the platform elicits the corresponding result type using the algorithm proposed. The platform also applies the concept of ontology to semantically improve search results. To retrieve the most relevant results from web documents, the segments of a user's query are mapped to the entities of the ontology. Two types of ontologies are used: goal ontologies and domain-specific ones, which comprise domain concepts and the relationships among them. In addition, the platform helps domain experts to generate the domain ontologies that will be used to extract data from web documents. Placing ontologies at the center of the approach integrates a level of semantics into the platform, resulting in more-precise output. The main contributions of this research are that it provides a goal-oriented platform for extracting data from web documents and integrates ontology-based development into web-document searches.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117322745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The objective of high-recall information retrieval (HRIR) is to identify substantially all information relevant to an information need, where the consequences of missing or untimely results may have serious legal, policy, health, social, safety, defence, or financial implications. To find acceptance in practice, HRIR technologies must be more effective---and must be shown to be more effective---than current practice, according to the legal, statutory, regulatory, ethical, or professional standards governing the application domain. Such domains include, but are not limited to, electronic discovery in legal proceedings; distinguishing between public and non-public records in the curation of government archives; systematic review for meta-analysis in evidence-based medicine; separating irregularities and intentional misstatements from unintentional errors in accounting restatements; performing "due diligence" in connection with pending mergers, acquisitions, and financing transactions; and surveillance and compliance activities involving massive datasets. HRIR differs from ad hoc information retrieval where the objective is to identify the best, rather than all relevant information, and from classification or categorization where the objective is to separate relevant from non-relevant information based on previously labeled training examples. HRIR is further differentiated from established information retrieval applications by the need to quantify "substantially all relevant information"; an objective for which existing evaluation strategies and measures, such as precision and recall, are not particularly well suited.
{"title":"The Quest for Total Recall","authors":"G. Cormack, Maura R. Grossman","doi":"10.1145/3209280.3232788","DOIUrl":"https://doi.org/10.1145/3209280.3232788","url":null,"abstract":"The objective of high-recall information retrieval (HRIR) is to identify substantially all information relevant to an information need, where the consequences of missing or untimely results may have serious legal, policy, health, social, safety, defence, or financial implications. To find acceptance in practice, HRIR technologies must be more effective---and must be shown to be more effective---than current practice, according to the legal, statutory, regulatory, ethical, or professional standards governing the application domain. Such domains include, but are not limited to, electronic discovery in legal proceedings; distinguishing between public and non-public records in the curation of government archives; systematic review for meta-analysis in evidence-based medicine; separating irregularities and intentional misstatements from unintentional errors in accounting restatements; performing \"due diligence\" in connection with pending mergers, acquisitions, and financing transactions; and surveillance and compliance activities involving massive datasets. HRIR differs from ad hoc information retrieval where the objective is to identify the best, rather than all relevant information, and from classification or categorization where the objective is to separate relevant from non-relevant information based on previously labeled training examples. HRIR is further differentiated from established information retrieval applications by the need to quantify \"substantially all relevant information\"; an objective for which existing evaluation strategies and measures, such as precision and recall, are not particularly well suited.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115359401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nikiforos Pittaras, George Giannakopoulos, Leonidas Tsekouras, Iraklis Varlamis
This work examines document clustering as a record linkage problem, focusing on named-entities and frequent terms, using several vector and graph-based document representation methods and k-means clustering with different similarity measures. The JedAI Record Linkage toolkit is employed for most of the record linkage pipeline tasks (i.e. preprocessing, scalable feature representation, blocking and clustering) and the OpenCalais platform for entity extraction. The resulting clusters are evaluated with multiple clustering quality metrics. The experiments show very good clustering results and significant speedups in the clustering process, which indicates the suitability of both the record linkage formulation and the JedAI toolkit for improving the scalability for large-scale document clustering tasks.
这项工作将文档聚类作为记录链接问题进行研究,重点关注命名实体和频繁术语,使用几种基于向量和图的文档表示方法以及具有不同相似性度量的k-means聚类。JedAI Record Linkage工具包用于大多数记录链接管道任务(即预处理、可扩展的特征表示、阻塞和聚类),OpenCalais平台用于实体提取。使用多个聚类质量指标评估得到的聚类。实验结果表明,在聚类过程中,记录链接公式和JedAI工具箱都适合于提高大规模文档聚类任务的可扩展性。
{"title":"Document clustering as a record linkage problem","authors":"Nikiforos Pittaras, George Giannakopoulos, Leonidas Tsekouras, Iraklis Varlamis","doi":"10.1145/3209280.3229109","DOIUrl":"https://doi.org/10.1145/3209280.3229109","url":null,"abstract":"This work examines document clustering as a record linkage problem, focusing on named-entities and frequent terms, using several vector and graph-based document representation methods and k-means clustering with different similarity measures. The JedAI Record Linkage toolkit is employed for most of the record linkage pipeline tasks (i.e. preprocessing, scalable feature representation, blocking and clustering) and the OpenCalais platform for entity extraction. The resulting clusters are evaluated with multiple clustering quality metrics. The experiments show very good clustering results and significant speedups in the clustering process, which indicates the suitability of both the record linkage formulation and the JedAI toolkit for improving the scalability for large-scale document clustering tasks.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117278602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents two different prediction models of Mathematical Expression Constraints (ME-Con) in technical publications. Based on the assumption of independent probability distributions, two types of features: FS, based on the ME symbols; FW, based on the words adjacent to MEs, are used for analysis. The first prediction model is based on an iterative greedy scheme aiming to optimize the performance goal. The second scheme is based on naïve Bayesian inference of the two different feature types considering the likelihood of the training data. The first model achieved an average F1 scores of 69.5% (based on the tests made on an Elsevier dataset). The second prediction model using FS achieved 82.4% for F1 score and 81.8% accuracy. And it achieved similar yet slightly higher F1 scores as that of the first model for the word stems of FW, but slightly lower F1 score for the Part-Of-Speech (POS) tags of FW.1
{"title":"Prediction of Mathematical Expression Constraints (ME-Con)","authors":"Jason Lin, Xing Wang, Jyh-Charn S. Liu","doi":"10.1145/3209280.3229106","DOIUrl":"https://doi.org/10.1145/3209280.3229106","url":null,"abstract":"This paper presents two different prediction models of Mathematical Expression Constraints (ME-Con) in technical publications. Based on the assumption of independent probability distributions, two types of features: FS, based on the ME symbols; FW, based on the words adjacent to MEs, are used for analysis. The first prediction model is based on an iterative greedy scheme aiming to optimize the performance goal. The second scheme is based on naïve Bayesian inference of the two different feature types considering the likelihood of the training data. The first model achieved an average F1 scores of 69.5% (based on the tests made on an Elsevier dataset). The second prediction model using FS achieved 82.4% for F1 score and 81.8% accuracy. And it achieved similar yet slightly higher F1 scores as that of the first model for the word stems of FW, but slightly lower F1 score for the Part-Of-Speech (POS) tags of FW.1","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"159 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132138350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}