首页 > 最新文献

Proceedings of the ACM Symposium on Document Engineering 2018最新文献

英文 中文
QuQn map: Qualitative-Quantitative mapping of scientific papers 群图:科学论文的定性-定量映射
Pub Date : 2018-08-28 DOI: 10.1145/3209280.3229116
Xing Wang, Jason Lin, Ryan Vrecenar, Jyh-Charn S. Liu
Mathematical Expressions (ME) and words are carefully bonded in technical writing to characterize physical concepts and their interactions quantitatively, and qualitatively. This paper proposes the Qualitative-Quantitative (QuQn) map as an abstraction of scientific papers to depict the dependency among MEs and their most related adjacent words. QuQn map aims to offer a succinct representation of the reasoning logic flow in a paper. Various filters can be applied to a QuQn map to reduce redundant/indirect links, control the display of problem settings (simple ME variables with declaration), and prune nodes with specific topological properties such as the largest connected subgraph. We developed a visualization tool prototype to support interactive browsing of the technical contents at different granularities of detail.
在技术写作中,数学表达式(ME)和单词被仔细地结合在一起,以定量和定性地描述物理概念及其相互作用。本文提出了一种定性-定量(QuQn)映射,将其作为科学论文的抽象概念来描述科学论文及其最相关相邻词之间的依赖关系。群群映射的目的是在论文中提供一个推理逻辑流的简洁表示。可以将各种过滤器应用于QuQn映射,以减少冗余/间接链接,控制问题设置的显示(带有声明的简单ME变量),并修剪具有特定拓扑属性(如最大连接子图)的节点。我们开发了一个可视化工具原型,以支持不同细节粒度的技术内容的交互式浏览。
{"title":"QuQn map: Qualitative-Quantitative mapping of scientific papers","authors":"Xing Wang, Jason Lin, Ryan Vrecenar, Jyh-Charn S. Liu","doi":"10.1145/3209280.3229116","DOIUrl":"https://doi.org/10.1145/3209280.3229116","url":null,"abstract":"Mathematical Expressions (ME) and words are carefully bonded in technical writing to characterize physical concepts and their interactions quantitatively, and qualitatively. This paper proposes the Qualitative-Quantitative (QuQn) map as an abstraction of scientific papers to depict the dependency among MEs and their most related adjacent words. QuQn map aims to offer a succinct representation of the reasoning logic flow in a paper. Various filters can be applied to a QuQn map to reduce redundant/indirect links, control the display of problem settings (simple ME variables with declaration), and prune nodes with specific topological properties such as the largest connected subgraph. We developed a visualization tool prototype to support interactive browsing of the technical contents at different granularities of detail.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122425815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Active High-Recall Information Retrieval from Domain-Specific Text Corpora based on Query Documents 基于查询文档的领域特定文本语料库的主动高查全率信息检索
Pub Date : 2018-08-28 DOI: 10.1145/3209280.3209532
Sitong Chen, A. Mohammad, Seyednaser Nourashrafeddin, E. Milios
In this paper, we propose a high recall active document retrieval system for a class of applications involving query documents, as opposed to key terms, and domain-specific document corpora. The output of the model is a list of documents retrieved based on the domain expert feedback collected during training. A modified version of Bag of Word (BoW) representation and a semantic ranking module, based on Google n-grams, are used in the model. The core of the system is a binary document classification model which is trained through a continuous active learning strategy. In general, finding or constructing training data for this type of problem is very difficult due to either confidentiality of the data, or the need for domain expert time to label data. Our experimental results on the retrieval of Call For Papers based on a manuscript demonstrate the efficacy of the system to address this application and its performance compared to other candidate models.
在本文中,我们为一类涉及查询文档的应用程序提出了一个高召回率的主动文档检索系统,而不是关键术语和特定于领域的文档语料库。模型的输出是基于训练期间收集的领域专家反馈检索到的文档列表。模型中使用了一个改进版本的Word包(BoW)表示和一个基于Google n-grams的语义排序模块。该系统的核心是一个二元文档分类模型,该模型通过连续主动学习策略进行训练。一般来说,由于数据的机密性,或者需要领域专家花时间标记数据,为这类问题找到或构建训练数据是非常困难的。我们在基于手稿的论文征稿检索上的实验结果表明,该系统在解决这一应用方面的有效性,以及与其他候选模型相比的性能。
{"title":"Active High-Recall Information Retrieval from Domain-Specific Text Corpora based on Query Documents","authors":"Sitong Chen, A. Mohammad, Seyednaser Nourashrafeddin, E. Milios","doi":"10.1145/3209280.3209532","DOIUrl":"https://doi.org/10.1145/3209280.3209532","url":null,"abstract":"In this paper, we propose a high recall active document retrieval system for a class of applications involving query documents, as opposed to key terms, and domain-specific document corpora. The output of the model is a list of documents retrieved based on the domain expert feedback collected during training. A modified version of Bag of Word (BoW) representation and a semantic ranking module, based on Google n-grams, are used in the model. The core of the system is a binary document classification model which is trained through a continuous active learning strategy. In general, finding or constructing training data for this type of problem is very difficult due to either confidentiality of the data, or the need for domain expert time to label data. Our experimental results on the retrieval of Call For Papers based on a manuscript demonstrate the efficacy of the system to address this application and its performance compared to other candidate models.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130672364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Understanding Documents with Hyperknowledge Specifications 理解具有超知识规范的文档
Pub Date : 2018-08-28 DOI: 10.1145/3209280.3229118
M. Moreno, Luiz José Schirmer Silva, M. D. Bayser, R. Brandão, Renato F. G. Cerqueira
Finding concepts considering their meaning and semantic relations in a document corpus is an important and challenging task. In this paper, we present our contributions on how to understand unstructured data present in one or multiple documents. Generally, the current literature concentrates efforts in structuring knowledge by identifying semantic entities in the data. In this paper, we test our hypothesis that hyperknowledge specifications are capable of defining rich relations among documents and extracted facts. The main evidence supporting this hypothesis is the fact that hyperknowledge was built on top of hypermedia fundamentals, easing the specification of rich relationships between different multimodal components (i.e. multimedia content and knowledge entities). The key challenge tackled in this paper is how to structure and correlate these components considering their meaning and semantic relations.
在文档语料库中寻找考虑其意义和语义关系的概念是一项重要而具有挑战性的任务。在本文中,我们介绍了我们在如何理解一个或多个文档中存在的非结构化数据方面的贡献。一般来说,目前的文献集中于通过识别数据中的语义实体来构建知识。在本文中,我们验证了我们的假设,即超知识规范能够定义文档和抽取事实之间的丰富关系。支持这一假设的主要证据是,超知识建立在超媒体基础之上,简化了不同多模态组件(即多媒体内容和知识实体)之间丰富关系的规范。本文所要解决的关键问题是如何根据这些组件的意义和语义关系来构建和关联这些组件。
{"title":"Understanding Documents with Hyperknowledge Specifications","authors":"M. Moreno, Luiz José Schirmer Silva, M. D. Bayser, R. Brandão, Renato F. G. Cerqueira","doi":"10.1145/3209280.3229118","DOIUrl":"https://doi.org/10.1145/3209280.3229118","url":null,"abstract":"Finding concepts considering their meaning and semantic relations in a document corpus is an important and challenging task. In this paper, we present our contributions on how to understand unstructured data present in one or multiple documents. Generally, the current literature concentrates efforts in structuring knowledge by identifying semantic entities in the data. In this paper, we test our hypothesis that hyperknowledge specifications are capable of defining rich relations among documents and extracted facts. The main evidence supporting this hypothesis is the fact that hyperknowledge was built on top of hypermedia fundamentals, easing the specification of rich relationships between different multimodal components (i.e. multimedia content and knowledge entities). The key challenge tackled in this paper is how to structure and correlate these components considering their meaning and semantic relations.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123414145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Automatic Text Summarization and Classification 自动文本摘要和分类
Pub Date : 2018-08-28 DOI: 10.1145/3209280.3232791
S. Simske, R. Lins
In this tutorial, we consider important aspects (algorithms, approaches, considerations) for tagging both unstructured and structured text for downstream use. This includes summarization, in which text information is compressed for more efficient archiving, searching, and clustering. In the tutorial, we focus on the topic of automatic text summarization, covering the most important milestones of the six decades of research in this area.
在本教程中,我们将讨论标记非结构化和结构化文本以供下游使用的重要方面(算法、方法和注意事项)。这包括摘要,其中文本信息被压缩,以便更有效地归档、搜索和聚类。在本教程中,我们将重点关注自动文本摘要的主题,涵盖该领域60年来研究的最重要里程碑。
{"title":"Automatic Text Summarization and Classification","authors":"S. Simske, R. Lins","doi":"10.1145/3209280.3232791","DOIUrl":"https://doi.org/10.1145/3209280.3232791","url":null,"abstract":"In this tutorial, we consider important aspects (algorithms, approaches, considerations) for tagging both unstructured and structured text for downstream use. This includes summarization, in which text information is compressed for more efficient archiving, searching, and clustering. In the tutorial, we focus on the topic of automatic text summarization, covering the most important milestones of the six decades of research in this area.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127855608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Query Expansion in Enterprise Search 企业搜索中的查询扩展
Pub Date : 2018-08-28 DOI: 10.1145/3209280.3229111
Eric M. Domke, J. Leidig, Gregory Schymik, G. Wolffe
Although web search remains an active research area, interest in enterprise search has not kept up with the information requirements of the contemporary workforce. To address these issues, this research aims to develop, implement, and study the query expansion techniques most effective at improving relevancy in enterprise search. The case-study instrument was a custom Apache Solr-based search application deployed at a medium-sized manufacturing company. It was hypothesized that a composition of techniques tailored to enterprise content and information needs would prove effective in increasing relevancy evaluation scores. Query expansion techniques leveraging entity recognition, alphanumeric term identification, and intent classification were implemented and studied using real enterprise content and query logs. They were evaluated against a set of test queries derived from relevance survey results using standard relevancy metrics such as normalized discounted cumulative gain (nDCG). Each of these modules produced meaningful and statistically significant improvements in relevancy.
尽管网络搜索仍然是一个活跃的研究领域,但对企业搜索的兴趣已经跟不上当代劳动力的信息需求。为了解决这些问题,本研究旨在开发、实现和研究在提高企业搜索相关性方面最有效的查询扩展技术。案例研究工具是部署在一家中型制造公司的基于Apache solr的定制搜索应用程序。据推测,为企业内容和信息需要量身定制的技术组合将证明在提高相关性评价分数方面是有效的。使用真实的企业内容和查询日志实现和研究了利用实体识别、字母数字术语识别和意图分类的查询扩展技术。它们是根据使用标准相关性度量(如规范化贴现累积增益(nDCG))的相关调查结果派生的一组测试查询进行评估的。这些模块中的每一个都在相关性方面产生了有意义的和统计上显著的改进。
{"title":"Query Expansion in Enterprise Search","authors":"Eric M. Domke, J. Leidig, Gregory Schymik, G. Wolffe","doi":"10.1145/3209280.3229111","DOIUrl":"https://doi.org/10.1145/3209280.3229111","url":null,"abstract":"Although web search remains an active research area, interest in enterprise search has not kept up with the information requirements of the contemporary workforce. To address these issues, this research aims to develop, implement, and study the query expansion techniques most effective at improving relevancy in enterprise search. The case-study instrument was a custom Apache Solr-based search application deployed at a medium-sized manufacturing company. It was hypothesized that a composition of techniques tailored to enterprise content and information needs would prove effective in increasing relevancy evaluation scores. Query expansion techniques leveraging entity recognition, alphanumeric term identification, and intent classification were implemented and studied using real enterprise content and query logs. They were evaluated against a set of test queries derived from relevance survey results using standard relevancy metrics such as normalized discounted cumulative gain (nDCG). Each of these modules produced meaningful and statistically significant improvements in relevancy.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127294454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparative Study between Traditional Machine Learning and Deep Learning Approaches for Text Classification 传统机器学习与深度学习文本分类方法的比较研究
Pub Date : 2018-08-28 DOI: 10.1145/3209280.3209526
Cannannore Nidhi Narayana Kamath, S. S. Bukhari, A. Dengel
In this contemporaneous world, it is an obligation for any organization working with documents to end up with the insipid task of classifying truckload of documents, which is the nascent stage of venturing into the realm of information retrieval and data mining. But classification of such humongous documents into multiple classes, calls for a lot of time and labor. Hence a system which could classify these documents with acceptable accuracy would be of an unfathomable help in document engineering. We have created multiple classifiers for document classification and compared their accuracy on raw and processed data. We have garnered data used in a corporate organization as well as publicly available data for comparison. Data is processed by removing the stop-words and stemming is implemented to produce root words. Multiple traditional machine learning techniques like Naive Bayes, Logistic Regression, Support Vector Machine, Random forest Classifier and Multi-Layer Perceptron are used for classification of documents. Classifiers are applied on raw and processed data separately and their accuracy is noted. Along with this, Deep learning technique such as Convolution Neural Network is also used to classify the data and its accuracy is compared with that of traditional machine learning techniques. We are also exploring hierarchical classifiers for classification of classes and subclasses. The system classifies the data faster and with better accuracy than if done manually. The results are discussed in the results and evaluation section.
在这个当代世界中,任何处理文档的组织都有义务结束对大量文档进行分类的乏味任务,这是冒险进入信息检索和数据挖掘领域的初级阶段。但是将如此庞大的文档分类为多个类别需要花费大量的时间和人力。因此,一个能够以可接受的准确性对这些文档进行分类的系统将对文档工程有不可估量的帮助。我们已经为文档分类创建了多个分类器,并比较了它们在原始数据和处理数据上的准确性。我们收集了一个公司组织中使用的数据以及可供比较的公开数据。通过删除停止词来处理数据,并实现词干提取以生成根词。使用朴素贝叶斯、逻辑回归、支持向量机、随机森林分类器和多层感知器等多种传统机器学习技术对文档进行分类。分类器分别应用于原始数据和处理过的数据,并指出了它们的准确性。与此同时,还使用卷积神经网络等深度学习技术对数据进行分类,并与传统机器学习技术的准确率进行比较。我们也在探索用于类和子类分类的层次分类器。该系统对数据进行分类的速度更快,而且比人工分类的准确性更高。结果将在结果和评价部分进行讨论。
{"title":"Comparative Study between Traditional Machine Learning and Deep Learning Approaches for Text Classification","authors":"Cannannore Nidhi Narayana Kamath, S. S. Bukhari, A. Dengel","doi":"10.1145/3209280.3209526","DOIUrl":"https://doi.org/10.1145/3209280.3209526","url":null,"abstract":"In this contemporaneous world, it is an obligation for any organization working with documents to end up with the insipid task of classifying truckload of documents, which is the nascent stage of venturing into the realm of information retrieval and data mining. But classification of such humongous documents into multiple classes, calls for a lot of time and labor. Hence a system which could classify these documents with acceptable accuracy would be of an unfathomable help in document engineering. We have created multiple classifiers for document classification and compared their accuracy on raw and processed data. We have garnered data used in a corporate organization as well as publicly available data for comparison. Data is processed by removing the stop-words and stemming is implemented to produce root words. Multiple traditional machine learning techniques like Naive Bayes, Logistic Regression, Support Vector Machine, Random forest Classifier and Multi-Layer Perceptron are used for classification of documents. Classifiers are applied on raw and processed data separately and their accuracy is noted. Along with this, Deep learning technique such as Convolution Neural Network is also used to classify the data and its accuracy is compared with that of traditional machine learning techniques. We are also exploring hierarchical classifiers for classification of classes and subclasses. The system classifies the data faster and with better accuracy than if done manually. The results are discussed in the results and evaluation section.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130055272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 58
GOWDA 高达
Pub Date : 2018-08-28 DOI: 10.1145/3209280.3229099
Bahareh Zarei, M. Gaedke
Each day, a vast amount of data is published on the web. In addition, the rate at which content is being published is growing, which has the potential to overwhelm users, particularly those who are technically unskilled. Furthermore, users from various domains of expertise face challenges when trying to retrieve the data they require. They may rely on IT experts, but these experts have limited knowledge of individual domains, making data extraction a time-consuming and error-prone task. It would be beneficial if domain experts were able to retrieve needed data and create relatively complex queries on top of web documents. The existing query solutions either are limited to a specific domain or require beginning with a predefined knowledge base or sample ontologies. To address these limitations, we propose a goal-oriented platform that enables users to easily extract data from web documents. This platform enables users to express their goals in natural language, after which the platform elicits the corresponding result type using the algorithm proposed. The platform also applies the concept of ontology to semantically improve search results. To retrieve the most relevant results from web documents, the segments of a user's query are mapped to the entities of the ontology. Two types of ontologies are used: goal ontologies and domain-specific ones, which comprise domain concepts and the relationships among them. In addition, the platform helps domain experts to generate the domain ontologies that will be used to extract data from web documents. Placing ontologies at the center of the approach integrates a level of semantics into the platform, resulting in more-precise output. The main contributions of this research are that it provides a goal-oriented platform for extracting data from web documents and integrates ontology-based development into web-document searches.
{"title":"GOWDA","authors":"Bahareh Zarei, M. Gaedke","doi":"10.1145/3209280.3229099","DOIUrl":"https://doi.org/10.1145/3209280.3229099","url":null,"abstract":"Each day, a vast amount of data is published on the web. In addition, the rate at which content is being published is growing, which has the potential to overwhelm users, particularly those who are technically unskilled. Furthermore, users from various domains of expertise face challenges when trying to retrieve the data they require. They may rely on IT experts, but these experts have limited knowledge of individual domains, making data extraction a time-consuming and error-prone task. It would be beneficial if domain experts were able to retrieve needed data and create relatively complex queries on top of web documents. The existing query solutions either are limited to a specific domain or require beginning with a predefined knowledge base or sample ontologies. To address these limitations, we propose a goal-oriented platform that enables users to easily extract data from web documents. This platform enables users to express their goals in natural language, after which the platform elicits the corresponding result type using the algorithm proposed. The platform also applies the concept of ontology to semantically improve search results. To retrieve the most relevant results from web documents, the segments of a user's query are mapped to the entities of the ontology. Two types of ontologies are used: goal ontologies and domain-specific ones, which comprise domain concepts and the relationships among them. In addition, the platform helps domain experts to generate the domain ontologies that will be used to extract data from web documents. Placing ontologies at the center of the approach integrates a level of semantics into the platform, resulting in more-precise output. The main contributions of this research are that it provides a goal-oriented platform for extracting data from web documents and integrates ontology-based development into web-document searches.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117322745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
The Quest for Total Recall 全面回忆的任务
Pub Date : 2018-08-28 DOI: 10.1145/3209280.3232788
G. Cormack, Maura R. Grossman
The objective of high-recall information retrieval (HRIR) is to identify substantially all information relevant to an information need, where the consequences of missing or untimely results may have serious legal, policy, health, social, safety, defence, or financial implications. To find acceptance in practice, HRIR technologies must be more effective---and must be shown to be more effective---than current practice, according to the legal, statutory, regulatory, ethical, or professional standards governing the application domain. Such domains include, but are not limited to, electronic discovery in legal proceedings; distinguishing between public and non-public records in the curation of government archives; systematic review for meta-analysis in evidence-based medicine; separating irregularities and intentional misstatements from unintentional errors in accounting restatements; performing "due diligence" in connection with pending mergers, acquisitions, and financing transactions; and surveillance and compliance activities involving massive datasets. HRIR differs from ad hoc information retrieval where the objective is to identify the best, rather than all relevant information, and from classification or categorization where the objective is to separate relevant from non-relevant information based on previously labeled training examples. HRIR is further differentiated from established information retrieval applications by the need to quantify "substantially all relevant information"; an objective for which existing evaluation strategies and measures, such as precision and recall, are not particularly well suited.
高召回率信息检索的目标是查明与某一信息需求有关的几乎所有信息,在这些信息中,丢失或不及时的结果可能产生严重的法律、政策、健康、社会、安全、国防或财政影响。为了在实践中得到认可,HRIR技术必须比当前的实践更有效——并且必须显示出更有效——根据管理应用领域的法律、法规、规范、道德或专业标准。这些领域包括但不限于:法律程序中的电子证据开示;政府档案管理中公开档案与非公开档案的区分循证医学荟萃分析的系统评价区分会计重述中的违规、故意错报与无意错报;执行与待决合并、收购和融资交易有关的“尽职调查”;以及涉及海量数据集的监督和合规活动。HRIR不同于特别信息检索,其目的是识别最佳信息,而不是所有相关信息,也不同于分类或分类,其目的是根据先前标记的训练示例将相关信息与不相关信息分开。HRIR与已建立的信息检索应用的进一步区别在于需要量化“基本上所有相关信息”;现有的评价策略和措施,如准确性和召回率,不是特别适合的目标。
{"title":"The Quest for Total Recall","authors":"G. Cormack, Maura R. Grossman","doi":"10.1145/3209280.3232788","DOIUrl":"https://doi.org/10.1145/3209280.3232788","url":null,"abstract":"The objective of high-recall information retrieval (HRIR) is to identify substantially all information relevant to an information need, where the consequences of missing or untimely results may have serious legal, policy, health, social, safety, defence, or financial implications. To find acceptance in practice, HRIR technologies must be more effective---and must be shown to be more effective---than current practice, according to the legal, statutory, regulatory, ethical, or professional standards governing the application domain. Such domains include, but are not limited to, electronic discovery in legal proceedings; distinguishing between public and non-public records in the curation of government archives; systematic review for meta-analysis in evidence-based medicine; separating irregularities and intentional misstatements from unintentional errors in accounting restatements; performing \"due diligence\" in connection with pending mergers, acquisitions, and financing transactions; and surveillance and compliance activities involving massive datasets. HRIR differs from ad hoc information retrieval where the objective is to identify the best, rather than all relevant information, and from classification or categorization where the objective is to separate relevant from non-relevant information based on previously labeled training examples. HRIR is further differentiated from established information retrieval applications by the need to quantify \"substantially all relevant information\"; an objective for which existing evaluation strategies and measures, such as precision and recall, are not particularly well suited.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115359401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Document clustering as a record linkage problem 文档聚类作为一个记录链接问题
Pub Date : 2018-08-28 DOI: 10.1145/3209280.3229109
Nikiforos Pittaras, George Giannakopoulos, Leonidas Tsekouras, Iraklis Varlamis
This work examines document clustering as a record linkage problem, focusing on named-entities and frequent terms, using several vector and graph-based document representation methods and k-means clustering with different similarity measures. The JedAI Record Linkage toolkit is employed for most of the record linkage pipeline tasks (i.e. preprocessing, scalable feature representation, blocking and clustering) and the OpenCalais platform for entity extraction. The resulting clusters are evaluated with multiple clustering quality metrics. The experiments show very good clustering results and significant speedups in the clustering process, which indicates the suitability of both the record linkage formulation and the JedAI toolkit for improving the scalability for large-scale document clustering tasks.
这项工作将文档聚类作为记录链接问题进行研究,重点关注命名实体和频繁术语,使用几种基于向量和图的文档表示方法以及具有不同相似性度量的k-means聚类。JedAI Record Linkage工具包用于大多数记录链接管道任务(即预处理、可扩展的特征表示、阻塞和聚类),OpenCalais平台用于实体提取。使用多个聚类质量指标评估得到的聚类。实验结果表明,在聚类过程中,记录链接公式和JedAI工具箱都适合于提高大规模文档聚类任务的可扩展性。
{"title":"Document clustering as a record linkage problem","authors":"Nikiforos Pittaras, George Giannakopoulos, Leonidas Tsekouras, Iraklis Varlamis","doi":"10.1145/3209280.3229109","DOIUrl":"https://doi.org/10.1145/3209280.3229109","url":null,"abstract":"This work examines document clustering as a record linkage problem, focusing on named-entities and frequent terms, using several vector and graph-based document representation methods and k-means clustering with different similarity measures. The JedAI Record Linkage toolkit is employed for most of the record linkage pipeline tasks (i.e. preprocessing, scalable feature representation, blocking and clustering) and the OpenCalais platform for entity extraction. The resulting clusters are evaluated with multiple clustering quality metrics. The experiments show very good clustering results and significant speedups in the clustering process, which indicates the suitability of both the record linkage formulation and the JedAI toolkit for improving the scalability for large-scale document clustering tasks.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117278602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Prediction of Mathematical Expression Constraints (ME-Con) 数学表达式约束预测(ME-Con)
Pub Date : 2018-08-28 DOI: 10.1145/3209280.3229106
Jason Lin, Xing Wang, Jyh-Charn S. Liu
This paper presents two different prediction models of Mathematical Expression Constraints (ME-Con) in technical publications. Based on the assumption of independent probability distributions, two types of features: FS, based on the ME symbols; FW, based on the words adjacent to MEs, are used for analysis. The first prediction model is based on an iterative greedy scheme aiming to optimize the performance goal. The second scheme is based on naïve Bayesian inference of the two different feature types considering the likelihood of the training data. The first model achieved an average F1 scores of 69.5% (based on the tests made on an Elsevier dataset). The second prediction model using FS achieved 82.4% for F1 score and 81.8% accuracy. And it achieved similar yet slightly higher F1 scores as that of the first model for the word stems of FW, but slightly lower F1 score for the Part-Of-Speech (POS) tags of FW.1
本文介绍了技术出版物中两种不同的数学表达式约束预测模型。基于独立概率分布的假设,有两类特征:FS,基于ME符号;FW根据与MEs相邻的单词进行分析。第一个预测模型基于迭代贪心方案,以优化性能为目标。第二种方案是基于naïve考虑训练数据的似然性,对两种不同的特征类型进行贝叶斯推理。第一个模型的F1平均得分为69.5%(基于在Elsevier数据集上进行的测试)。第二个使用FS的预测模型F1得分达到82.4%,准确率达到81.8%。对于FW的词干,该模型的F1得分与第一种模型相似,但略高,但对于FW的词性标签的F1得分略低
{"title":"Prediction of Mathematical Expression Constraints (ME-Con)","authors":"Jason Lin, Xing Wang, Jyh-Charn S. Liu","doi":"10.1145/3209280.3229106","DOIUrl":"https://doi.org/10.1145/3209280.3229106","url":null,"abstract":"This paper presents two different prediction models of Mathematical Expression Constraints (ME-Con) in technical publications. Based on the assumption of independent probability distributions, two types of features: FS, based on the ME symbols; FW, based on the words adjacent to MEs, are used for analysis. The first prediction model is based on an iterative greedy scheme aiming to optimize the performance goal. The second scheme is based on naïve Bayesian inference of the two different feature types considering the likelihood of the training data. The first model achieved an average F1 scores of 69.5% (based on the tests made on an Elsevier dataset). The second prediction model using FS achieved 82.4% for F1 score and 81.8% accuracy. And it achieved similar yet slightly higher F1 scores as that of the first model for the word stems of FW, but slightly lower F1 score for the Part-Of-Speech (POS) tags of FW.1","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"159 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132138350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Proceedings of the ACM Symposium on Document Engineering 2018
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1