Australasian Document Computing Symposium最新文献

英文中文

Classifying microblogs for disasters 对微博进行灾难分类

Australasian Document Computing Symposium

Pub Date : 2013-12-05 DOI: 10.1145/2537734.2537737

Sarvnaz Karimi, Jie Yin, Cécile Paris

Monitoring social media in critical disaster situations can potentially assist emergency and media personnel to deal with events as they unfold, and focus their resources where they are most needed. We address the issue of filtering massive amounts of Twitter data to identify high-value messages related to disasters, and to further classify disaster-related messages into those pertaining to particular disaster types, such as earthquake, flooding, fire, or storm. Unlike post-hoc analysis that most previous studies have done, we focus on building a classification model on past incidents to detect tweets about current incidents. Our experimental results demonstrate the feasibility of using classification methods to identify disaster-related tweets. We analyse the effect of different features in classifying tweets and show that using generic features rather than incident-specific ones leads to better generalisation on the effectiveness of classifying unseen incidents.

在重大灾害情况下监测社会媒体可能有助于紧急情况和媒体人员在事件发生时进行处理，并将资源集中在最需要的地方。我们解决了过滤大量Twitter数据以识别与灾难相关的高价值消息的问题，并进一步将与灾难相关的消息分类为与特定灾难类型(如地震、洪水、火灾或风暴)相关的消息。与之前大多数研究所做的事后分析不同，我们专注于在过去的事件上建立一个分类模型，以检测有关当前事件的推文。我们的实验结果证明了使用分类方法识别灾害相关推文的可行性。我们分析了不同特征对推文分类的影响，并表明使用通用特征而不是特定于事件的特征可以更好地概括对未见事件进行分类的有效性。

引用次数: 44

ADCS reaches adulthood: an analysis of the conference and its community over the last eighteen years ADCS走向成年:对过去18年会议及其社区的分析

Australasian Document Computing Symposium

Pub Date : 2013-12-05 DOI: 10.1145/2537734.2537741

B. Koopman, G. Zuccon, Lance De Vine, Aneesha Bakharia, P. Bruza, Laurianne Sitbon, Andrew Gibson

How influential is the Australian Document Computing Symposium (ADCS)? What do ADCS articles speak about and who cites them? Who is the ADCS community and how has it evolved? This paper considers eighteen years of ADCS, investigating both the conference and its community. A content analysis of the proceedings uncovers the diversity of topics covered in ADCS and how these have changed over the years. Citation analysis reveals the impact of the papers. The number of authors and where they originate from reveal who has contributed to the conference. Finally, we generate co-author networks which reveal the collaborations within the community. These networks show how clusters of researchers form, the effect geographic location has on collaboration, and how these have evolved over time.

澳大利亚文档计算研讨会(ADCS)的影响力有多大?ADCS的文章讲了什么，谁引用了这些文章?谁是ADCS社区，它是如何发展的?本文回顾了18年的ADCS，调查了会议及其社区。会议记录的内容分析揭示了ADCS所涵盖主题的多样性以及这些主题多年来的变化情况。引文分析揭示了论文的影响力。作者的数量和他们的来源揭示了谁对会议做出了贡献。最后，我们生成合作者网络，揭示社区内的合作。这些网络显示了研究人员集群是如何形成的，地理位置对合作的影响，以及它们是如何随着时间的推移而演变的。

引用次数: 0

Quality biased thread retrieval using the voting model 使用投票模型的质量偏差线程检索

Australasian Document Computing Symposium

Pub Date : 2013-12-05 DOI: 10.1145/2537734.2537752

Ameer Tawfik Albaham, N. Salim

Thread retrieval is an essential tool in knowledge-based forums. However, forum content quality varies from excellent to mediocre and spam; thus, search methods should find not only relevant threads but also those with high quality content. Some studies have shown that leveraging quality indicators improves thread search. However, these studies ignored the hierarchical and the conversational structures of threads in estimating topical relevance and content quality. In that regard, this paper introduces leveraging message quality indicators in ranking threads. To achieve this, we first use the Voting Model to convert message level quality features into thread level features. We then train a learning to rank method to combine these thread level features. Preliminary results with some features reveal that representing threads as collections of messages is superior to treating them as concatenations of their messages. The results show also the utility of leveraging message content quality as compared to non quality-based methods.

在知识型论坛中，线程检索是一个必不可少的工具。然而，论坛内容质量从优秀到平庸和垃圾不等;因此，搜索方法不仅要找到相关的话题，还要找到高质量的内容。一些研究表明，利用质量指标可以改善线程搜索。然而，这些研究在评估主题相关性和内容质量时忽略了线程的层次结构和会话结构。在这方面，本文介绍了在线程排序中利用消息质量指标。为了实现这一点，我们首先使用投票模型将消息级质量特征转换为线程级特征。然后，我们训练一种学习排序方法来组合这些线程级别的特征。一些特性的初步结果表明，将线程表示为消息集合优于将它们视为消息的连接。结果还显示了与非基于质量的方法相比，利用消息内容质量的效用。

引用次数: 7

Merging algorithms for enterprise search 企业搜索的合并算法

Australasian Document Computing Symposium

Pub Date : 2013-12-05 DOI: 10.1145/2537734.2537750

Pengfei Li, Paul Thomas, D. Hawking

Effective enterprise search must draw on a number of sources---for example web pages, telephone directories, and databases. Doing this means we need a way to make a single sorted list from results of very different types. Many merging algorithms have been proposed but none have been applied to this, realistic, application. We report the results of an experiment which simulates heterogeneous enterprise retrieval, in a university setting, and uses multi-grade expert judgements to compare merging algorithms. Merging algorithms considered include several variants of round-robin, several methods proposed by Rasolofo et al. in the Current News Metasearcher, and four novel variations including a learned multi-weight method. We find that the round-robin methods and one of the Rasolofo methods perform significantly worse than others. The GDS_TS method of Rasolofo achieves the highest average NDCG@10 score but the differences between it and the other GDS_methods, local reranking, and the multi-weight method were not significant.

有效的企业搜索必须利用许多资源——例如网页、电话目录和数据库。这样做意味着我们需要一种方法来从非常不同类型的结果中生成一个排序列表。已经提出了许多合并算法，但没有一个应用于这种实际应用。我们报告了在大学环境中模拟异构企业检索的实验结果，并使用多级专家判断来比较合并算法。考虑的合并算法包括轮询的几种变体，Rasolofo等人在Current News Metasearcher中提出的几种方法，以及包括学习的多权重方法在内的四种新变体。我们发现循环方法和其中一种Rasolofo方法的性能明显比其他方法差。Rasolofo的GDS_TS方法获得了最高的NDCG@10平均得分，但与其他gds_方法、局部重排序法和多权重法之间的差异不显著。

引用次数: 9

Efficient top-k retrieval with signatures 带签名的高效top-k检索

Australasian Document Computing Symposium

Pub Date : 2013-12-05 DOI: 10.1145/2537734.2537742

Timothy Chappell, S. Geva, Anthony N. Nguyen, G. Zuccon

This paper describes a new method of indexing and searching large binary signature collections to efficiently find similar signatures, addressing the scalability problem in signature search. Signatures offer efficient computation with acceptable measure of similarity in numerous applications. However, performing a complete search with a given search argument (a signature) requires a Hamming distance calculation against every signature in the collection. This quickly becomes excessive when dealing with large collections, presenting issues of scalability that limit their applicability. Our method efficiently finds similar signatures in very large collections, trading memory use and precision for greatly improved search speed. Experimental results demonstrate that our approach is capable of finding a set of nearest signatures to a given search argument with a high degree of speed and fidelity.

本文提出了一种对大型二进制签名集合进行索引和搜索的新方法，以有效地找到相似的签名，解决了签名搜索中的可扩展性问题。在许多应用程序中，签名提供了具有可接受的相似性度量的高效计算。但是，使用给定的搜索参数(签名)执行完整搜索需要对集合中的每个签名进行汉明距离计算。在处理大型集合时，这很快就会变得过度，呈现出限制其适用性的可伸缩性问题。我们的方法在非常大的集合中有效地找到相似的签名，以内存使用和精度为代价，极大地提高了搜索速度。实验结果表明，我们的方法能够以较高的速度和保真度找到与给定搜索参数最接近的一组签名。

引用次数: 10

Integrated instance- and class-based generative modeling for text classification 集成了基于实例和类的文本分类生成建模

Australasian Document Computing Symposium

Pub Date : 2013-12-05 DOI: 10.1145/2537734.2537751

Antti Puurula, Sung-Hyon Myaeng

Statistical methods for text classification are predominantly based on the paradigm of class-based learning that associates class variables with features, discarding the instances of data after model training. This results in efficient models, but neglects the fine-grained information present in individual documents. Instance-based learning uses this information, but suffers from data sparsity with text data. In this paper, we propose a generative model called Tied Document Mixture (TDM) for extending Multinomial Naive Bayes (MNB) with mixtures of hierarchically smoothed models for documents. Alternatively, TDM can be viewed as a Kernel Density Classifier using class-smoothed Multinomial kernels. TDM is evaluated for classification accuracy on 14 different datasets for multi-label, multi-class and binary-class text classification tasks and compared to instance- and class-based learning baselines. The comparisons to MNB demonstrate a substantial improvement in accuracy as a function of available training documents per class, ranging up to average error reductions of over 26% in sentiment classification and 65% in spam classification. On average TDM is as accurate as the best discriminative classifiers, but retains the linear time complexities of instance-based learning methods, with exact algorithms for both model estimation and inference.

文本分类的统计方法主要基于基于类的学习范式，将类变量与特征相关联，在模型训练后丢弃数据实例。这将产生高效的模型，但忽略了单个文档中存在的细粒度信息。基于实例的学习使用这些信息，但在文本数据方面存在数据稀疏性问题。在本文中，我们提出了一种称为绑定文档混合(TDM)的生成模型，用于用层次平滑模型的混合扩展多项朴素贝叶斯(MNB)。另外，TDM可以看作是使用类平滑多项式核的核密度分类器。TDM在14个不同的数据集上对多标签、多类和二类文本分类任务的分类精度进行了评估，并与基于实例和基于类的学习基线进行了比较。与MNB的比较表明，作为每类可用训练文档的函数，准确率有了实质性的提高，情感分类的平均误差减少了26%以上，垃圾邮件分类的平均误差减少了65%。平均而言，TDM与最佳判别分类器一样准确，但保留了基于实例的学习方法的线性时间复杂性，并具有用于模型估计和推理的精确算法。

{"title":"Integrated instance- and class-based generative modeling for text classification","authors":"Antti Puurula, Sung-Hyon Myaeng","doi":"10.1145/2537734.2537751","DOIUrl":"https://doi.org/10.1145/2537734.2537751","url":null,"abstract":"Statistical methods for text classification are predominantly based on the paradigm of class-based learning that associates class variables with features, discarding the instances of data after model training. This results in efficient models, but neglects the fine-grained information present in individual documents. Instance-based learning uses this information, but suffers from data sparsity with text data. In this paper, we propose a generative model called Tied Document Mixture (TDM) for extending Multinomial Naive Bayes (MNB) with mixtures of hierarchically smoothed models for documents. Alternatively, TDM can be viewed as a Kernel Density Classifier using class-smoothed Multinomial kernels. TDM is evaluated for classification accuracy on 14 different datasets for multi-label, multi-class and binary-class text classification tasks and compared to instance- and class-based learning baselines. The comparisons to MNB demonstrate a substantial improvement in accuracy as a function of available training documents per class, ranging up to average error reductions of over 26% in sentiment classification and 65% in spam classification. On average TDM is as accurate as the best discriminative classifiers, but retains the linear time complexities of instance-based learning methods, with exact algorithms for both model estimation and inference.","PeriodicalId":402985,"journal":{"name":"Australasian Document Computing Symposium","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130467382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Choices in batch information retrieval evaluation 批量信息检索评价中的选择

Australasian Document Computing Symposium

Pub Date : 2013-12-05 DOI: 10.1145/2537734.2537745

Falk Scholer, Alistair Moffat, Paul Thomas

Web search tools are used on a daily basis by billions of people. The commercial providers of these services spend large amounts of money measuring their own effectiveness and benchmarking against their competitors; nothing less than their corporate survival is at stake. Techniques for offline or "batch" evaluation of search quality have received considerable attention, spanning ways of constructing relevance judgments; ways of using them to generate numeric scores; and ways of inferring system "superiority" from sets of such scores. Our purpose in this paper is consider these mechanisms as a chain of inter-dependent activities, in order to explore some of the ramifications of alternative components. By disaggregating the different activities, and asking what the ultimate objective of the measurement process is, we provide new insights into evaluation approaches, and are able to suggest new combinations that might prove fruitful avenues for exploration. Our observations are examined with reference to data collected from a user study covering 34 users undertaking a total of six search tasks each, using two systems of markedly different quality. We hope to encourage broader awareness of the many factors that go into an evaluation of search effectiveness, and of the implications of these choices, and encourage researchers to carefully report all aspects of the evaluation process when describing their system performance experiments.

数十亿人每天都在使用网络搜索工具。这些服务的商业提供者花费大量资金来衡量自己的有效性，并与竞争对手进行比较;他们的企业生存岌岌可危。离线或“批量”搜索质量评估技术受到了相当大的关注，涵盖了构建相关性判断的方法;使用它们生成数字分数的方法;以及从这些分数中推断系统“优越性”的方法。我们在本文中的目的是将这些机制视为相互依赖的活动链，以探索替代组件的一些后果。通过分解不同的活动，并询问测量过程的最终目标是什么，我们为评估方法提供了新的见解，并且能够提出新的组合，这可能证明探索的有效途径。我们的观察结果是根据用户研究收集的数据进行检验的，该研究涵盖了34个用户，每个用户总共承担6个搜索任务，使用两种质量明显不同的系统。我们希望鼓励人们更广泛地认识到搜索有效性评估的许多因素，以及这些选择的含义，并鼓励研究人员在描述他们的系统性能实验时仔细报告评估过程的所有方面。

{"title":"Choices in batch information retrieval evaluation","authors":"Falk Scholer, Alistair Moffat, Paul Thomas","doi":"10.1145/2537734.2537745","DOIUrl":"https://doi.org/10.1145/2537734.2537745","url":null,"abstract":"Web search tools are used on a daily basis by billions of people. The commercial providers of these services spend large amounts of money measuring their own effectiveness and benchmarking against their competitors; nothing less than their corporate survival is at stake. Techniques for offline or \"batch\" evaluation of search quality have received considerable attention, spanning ways of constructing relevance judgments; ways of using them to generate numeric scores; and ways of inferring system \"superiority\" from sets of such scores.\u0000 Our purpose in this paper is consider these mechanisms as a chain of inter-dependent activities, in order to explore some of the ramifications of alternative components. By disaggregating the different activities, and asking what the ultimate objective of the measurement process is, we provide new insights into evaluation approaches, and are able to suggest new combinations that might prove fruitful avenues for exploration. Our observations are examined with reference to data collected from a user study covering 34 users undertaking a total of six search tasks each, using two systems of markedly different quality.\u0000 We hope to encourage broader awareness of the many factors that go into an evaluation of search effectiveness, and of the implications of these choices, and encourage researchers to carefully report all aspects of the evaluation process when describing their system performance experiments.","PeriodicalId":402985,"journal":{"name":"Australasian Document Computing Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131889517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Economic models of search 搜索的经济模型

Australasian Document Computing Symposium

Pub Date : 2013-12-05 DOI: 10.1145/2537734.2537735

L. Azzopardi

Searching is inherently an interactive process usually requiring a number of queries to be submitted and a number of documents to be assessed in order to find the desired amount of relevant information. While numerous models of search have been proposed, they have been largely conceptual in nature providing a descriptive account of the search process. For example, Bates' Berry Picking metaphor aptly describes how information seekers forage for relevant information [4]. However it lacks any predictive or explanatory power. In this talk, I will outline how microeconomic theory can be applied to interactive information retrieval, where the search process can be viewed as a combination of inputs (i.e. queries and assessments) which are used to "produce" output (i.e. relevance). Under this view, it is possible to build models that not only describe the relationship between interaction, cost and gain, but also explain and predict behaviour. During the talk, I will run through a number of examples of how economics can explain different behaviours. For example, why PhD students should search more than their supervisors (using an economic model developed by Cooper [6]), why queries are short [1], why Boolean searchers need to explore more results, and why it is okay to look at the first few results when searching the web [2]. I shall then describe how the cost of different interactions affect search behaviour [3], before extending the current theory to include other variables (such as the time spent on the search result page, the interaction with snippets, etc) to create more sophisticated and realistic models. Essentially, I will argue that by using such models we can: 1. theorise and predict how users will behave when interacting with systems, 2. ascertain how the costs of different interaction will influence search behaviour, 3. understand why particular interaction styles, strategies, techniques are or are not adopted by users, and, 4. determine what interactions and functionalities are worthwhile based on their expected gain and associated costs.

搜索本质上是一个交互过程，通常需要提交大量查询和评估大量文档，以便找到所需数量的相关信息。虽然已经提出了许多搜索模型，但它们在本质上主要是概念性的，提供了搜索过程的描述性说明。例如，贝茨的浆果采摘比喻恰当地描述了信息搜寻者如何搜寻相关信息。然而，它缺乏任何预测或解释能力。在这次演讲中，我将概述微观经济理论如何应用于交互式信息检索，其中搜索过程可以被视为用于“产生”输出(即相关性)的输入(即查询和评估)的组合。在这种观点下，建立模型不仅可以描述交互、成本和收益之间的关系，还可以解释和预测行为。在演讲中，我将通过一些例子来说明经济学如何解释不同的行为。例如，为什么博士生应该比他们的导师搜索更多(使用Cooper开发的经济模型b[6])，为什么查询很短b[1]，为什么布尔搜索者需要探索更多的结果，以及为什么在搜索网络时查看前几个结果是可以的b[2]。然后，我将描述不同交互的成本如何影响搜索行为[3]，然后扩展当前的理论，包括其他变量(如在搜索结果页面上花费的时间，与片段的交互等)，以创建更复杂和现实的模型。从本质上讲，我认为通过使用这些模型，我们可以:理论化并预测用户在与系统交互时的行为;确定不同交互的成本将如何影响搜索行为;理解为什么特定的交互风格、策略、技术会被用户采用或不被用户采用;根据预期收益和相关成本，确定哪些交互和功能值得使用。

{"title":"Economic models of search","authors":"L. Azzopardi","doi":"10.1145/2537734.2537735","DOIUrl":"https://doi.org/10.1145/2537734.2537735","url":null,"abstract":"Searching is inherently an interactive process usually requiring a number of queries to be submitted and a number of documents to be assessed in order to find the desired amount of relevant information. While numerous models of search have been proposed, they have been largely conceptual in nature providing a descriptive account of the search process. For example, Bates' Berry Picking metaphor aptly describes how information seekers forage for relevant information [4]. However it lacks any predictive or explanatory power. In this talk, I will outline how microeconomic theory can be applied to interactive information retrieval, where the search process can be viewed as a combination of inputs (i.e. queries and assessments) which are used to \"produce\" output (i.e. relevance). Under this view, it is possible to build models that not only describe the relationship between interaction, cost and gain, but also explain and predict behaviour. During the talk, I will run through a number of examples of how economics can explain different behaviours. For example, why PhD students should search more than their supervisors (using an economic model developed by Cooper [6]), why queries are short [1], why Boolean searchers need to explore more results, and why it is okay to look at the first few results when searching the web [2]. I shall then describe how the cost of different interactions affect search behaviour [3], before extending the current theory to include other variables (such as the time spent on the search result page, the interaction with snippets, etc) to create more sophisticated and realistic models. Essentially, I will argue that by using such models we can:\u0000 1. theorise and predict how users will behave when interacting with systems,\u0000 2. ascertain how the costs of different interaction will influence search behaviour,\u0000 3. understand why particular interaction styles, strategies, techniques are or are not adopted by users, and,\u0000 4. determine what interactions and functionalities are worthwhile based on their expected gain and associated costs.","PeriodicalId":402985,"journal":{"name":"Australasian Document Computing Symposium","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132027033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Graph-based concept weighting for medical information retrieval 基于图的医学信息检索概念加权

Australasian Document Computing Symposium

Pub Date : 2012-12-05 DOI: 10.1145/2407085.2407096

B. Koopman, G. Zuccon, P. Bruza, Laurianne Sitbon, Michael Lawley

This paper presents a graph-based method to weight medical concepts in documents for the purposes of information retrieval. Medical concepts are extracted from free-text documents using a state-of-the-art technique that maps n-grams to concepts from the SNOMED CT medical ontology. In our graph-based concept representation, concepts are vertices in a graph built from a document, edges represent associations between concepts. This representation naturally captures dependencies between concepts, an important requirement for interpreting medical text, and a feature lacking in bag-of-words representations. We apply existing graph-based term weighting methods to weight medical concepts. Using concepts rather than terms addresses vocabulary mismatch as well as encapsulates terms belonging to a single medical entity into a single concept. In addition, we further extend previous graph-based approaches by injecting domain knowledge that estimates the importance of a concept within the global medical domain. Retrieval experiments on the TREC Medical Records collection show our method outperforms both term and concept baselines. More generally, this work provides a means of integrating background knowledge contained in medical ontologies into data-driven information retrieval approaches.

本文提出了一种基于图的医学概念权重方法，用于信息检索。医学概念是从自由文本文档中提取的，使用最先进的技术，将n-图映射到SNOMED CT医学本体的概念。在我们基于图的概念表示中，概念是由文档构建的图中的顶点，边表示概念之间的关联。这种表示自然地捕获了概念之间的依赖关系，这是解释医学文本的一个重要要求，也是词袋表示中缺乏的一个特性。我们将现有的基于图的术语加权方法应用于医学概念的加权。使用概念而不是术语可以解决词汇表不匹配的问题，并将属于单个医疗实体的术语封装到单个概念中。此外，我们通过注入领域知识进一步扩展了以前基于图的方法，这些领域知识可以估计一个概念在全球医学领域中的重要性。在TREC医疗记录集合上的检索实验表明，我们的方法优于术语基线和概念基线。更一般地说，这项工作提供了一种将医学本体中包含的背景知识集成到数据驱动的信息检索方法中的方法。

{"title":"Graph-based concept weighting for medical information retrieval","authors":"B. Koopman, G. Zuccon, P. Bruza, Laurianne Sitbon, Michael Lawley","doi":"10.1145/2407085.2407096","DOIUrl":"https://doi.org/10.1145/2407085.2407096","url":null,"abstract":"This paper presents a graph-based method to weight medical concepts in documents for the purposes of information retrieval. Medical concepts are extracted from free-text documents using a state-of-the-art technique that maps n-grams to concepts from the SNOMED CT medical ontology. In our graph-based concept representation, concepts are vertices in a graph built from a document, edges represent associations between concepts. This representation naturally captures dependencies between concepts, an important requirement for interpreting medical text, and a feature lacking in bag-of-words representations.\u0000 We apply existing graph-based term weighting methods to weight medical concepts. Using concepts rather than terms addresses vocabulary mismatch as well as encapsulates terms belonging to a single medical entity into a single concept. In addition, we further extend previous graph-based approaches by injecting domain knowledge that estimates the importance of a concept within the global medical domain.\u0000 Retrieval experiments on the TREC Medical Records collection show our method outperforms both term and concept baselines. More generally, this work provides a means of integrating background knowledge contained in medical ontologies into data-driven information retrieval approaches.","PeriodicalId":402985,"journal":{"name":"Australasian Document Computing Symposium","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117354339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Reordering an index to speed query processing without loss of effectiveness 重新排序索引以加快查询处理速度，同时不损失效率

Australasian Document Computing Symposium

Pub Date : 2012-12-05 DOI: 10.1145/2407085.2407088

D. Hawking, Timothy Jones

Following Long and Suel, we empirically investigate the importance of document order in search engines which rank documents using a combination of dynamic (query-dependent) and static (query-independent) scores, and use document-at-a-time (DAAT) processing. When inverted file postings are in collection order, assigning document numbers in order of descending static score supports lossless early termination while maintaining good compression. Since static scores may not be available until all documents have been gathered and indexed, we build a tool for reordering an existing index and show that it operates in less than 20% of the original indexing time. We note that this additional cost is easily recouped by savings at query processing time. We compare best early-termination points for several different index orders on three enterprise search collections (a whole-of-government index with two very different query sets, and a collection from a UK university). We also present results for the same orders for ClueWeb09-CatB. Our evaluation focuses on finding results likely to be clicked on by users of Web or website search engines --- Nav and Key results in the TREC 2011 Web Track judging scheme. The orderings tested are Original, Reverse, Random, and QIE (descending order of static score). For three enterprise search test sets we find that QIE order can achieve close-to-maximal search effectiveness with much lower computational cost than for other orderings. Additionally, reordering has negligible impact on compressed index size for indexes that contain position information. Our results for an artificial query set against the TREC ClueWeb09 Category B collection are much more equivocal and we canvass possible explanations for future investigation.

继Long和Suel之后，我们实证研究了文档顺序在搜索引擎中的重要性，该搜索引擎使用动态(依赖于查询)和静态(独立于查询)分数的组合对文档进行排名，并使用每次文档(DAAT)处理。当倒置的文件发布按集合顺序排列时，按静态分数降序分配文档号支持无损的早期终止，同时保持良好的压缩。由于静态分数可能在所有文档都被收集和索引之后才可用，因此我们构建了一个工具，用于对现有索引进行重新排序，并表明它的运行时间不到原始索引时间的20%。我们注意到，这个额外的成本很容易通过查询处理时间的节省来弥补。我们比较了三个企业搜索集合(一个具有两个非常不同查询集的整个政府索引和一个来自英国大学的集合)上几个不同索引顺序的最佳早期终止点。我们还提供了ClueWeb09-CatB相同订单的结果。我们的评估侧重于寻找可能被网络或网站搜索引擎用户点击的结果——TREC 2011 Web Track判断方案中的导航和关键结果。测试的顺序为Original, Reverse, Random, QIE(静态分数降序)。对于三个企业搜索测试集，我们发现与其他排序相比，QIE排序能够以更低的计算成本获得接近最大的搜索效率。此外，对于包含位置信息的索引，重新排序对压缩索引大小的影响可以忽略不计。我们针对TREC ClueWeb09类别B集合的人工查询集的结果更加模棱两可，我们为未来的调查寻找可能的解释。

{"title":"Reordering an index to speed query processing without loss of effectiveness","authors":"D. Hawking, Timothy Jones","doi":"10.1145/2407085.2407088","DOIUrl":"https://doi.org/10.1145/2407085.2407088","url":null,"abstract":"Following Long and Suel, we empirically investigate the importance of document order in search engines which rank documents using a combination of dynamic (query-dependent) and static (query-independent) scores, and use document-at-a-time (DAAT) processing. When inverted file postings are in collection order, assigning document numbers in order of descending static score supports lossless early termination while maintaining good compression.\u0000 Since static scores may not be available until all documents have been gathered and indexed, we build a tool for reordering an existing index and show that it operates in less than 20% of the original indexing time. We note that this additional cost is easily recouped by savings at query processing time. We compare best early-termination points for several different index orders on three enterprise search collections (a whole-of-government index with two very different query sets, and a collection from a UK university). We also present results for the same orders for ClueWeb09-CatB. Our evaluation focuses on finding results likely to be clicked on by users of Web or website search engines --- Nav and Key results in the TREC 2011 Web Track judging scheme.\u0000 The orderings tested are Original, Reverse, Random, and QIE (descending order of static score). For three enterprise search test sets we find that QIE order can achieve close-to-maximal search effectiveness with much lower computational cost than for other orderings. Additionally, reordering has negligible impact on compressed index size for indexes that contain position information. Our results for an artificial query set against the TREC ClueWeb09 Category B collection are much more equivocal and we canvass possible explanations for future investigation.","PeriodicalId":402985,"journal":{"name":"Australasian Document Computing Symposium","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115588017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Australasian Document Computing Symposium

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀