首页 > 最新文献

Australasian Document Computing Symposium最新文献

英文 中文
Putting the public into public health information dissemination: social media and health-related web pages 让公众参与公共卫生信息传播:社交媒体和健康相关网页
Pub Date : 2012-12-05 DOI: 10.1145/2407085.2407104
R. Steele, Dan Dumbrell
Public health information dissemination represents an interesting combination of broadcasting, sharing, and retrieving relevant health information. Social media-based public health information dissemination offers some particularly interesting characteristics, as individual users or members of the public actually carry out the actions that constitute the dissemination. These actions also may inherently provide novel evaluative information from a document computing perspective, providing information in relation to both documents and indeed the social media users or health consumers themselves. This paper discusses the novel aspects of social media-based public health information dissemination, including a comparison of its characteristics with search engine-based Web document retrieval. A preliminary analysis of a sample of public health advice tweets taken from a larger sample of over 4700 tweets sent by Australian health-related organization in February 2012 is described. Various preliminary measures are analyzed from this data to initially suggest possible characteristics of public health information dissemination and document evaluation in micro-blog-based systems based on this sample.
公共卫生信息传播是广播、共享和检索相关卫生信息的有趣组合。基于社交媒体的公共卫生信息传播提供了一些特别有趣的特点,因为个人用户或公众成员实际上进行了构成传播的行动。从文档计算的角度来看,这些操作还可能固有地提供新颖的评价信息,提供与文档以及实际上与社交媒体用户或健康消费者本身有关的信息。本文讨论了基于社交媒体的公共卫生信息传播的新方面,包括其与基于搜索引擎的Web文档检索的特点的比较。本文描述了对2012年2月澳大利亚卫生组织发出的4700多条推文的更大样本中的公共卫生建议推文样本进行的初步分析。根据这些数据分析各种初步措施,初步提出基于该样本的微博系统公共卫生信息传播和文献评价的可能特征。
{"title":"Putting the public into public health information dissemination: social media and health-related web pages","authors":"R. Steele, Dan Dumbrell","doi":"10.1145/2407085.2407104","DOIUrl":"https://doi.org/10.1145/2407085.2407104","url":null,"abstract":"Public health information dissemination represents an interesting combination of broadcasting, sharing, and retrieving relevant health information. Social media-based public health information dissemination offers some particularly interesting characteristics, as individual users or members of the public actually carry out the actions that constitute the dissemination. These actions also may inherently provide novel evaluative information from a document computing perspective, providing information in relation to both documents and indeed the social media users or health consumers themselves. This paper discusses the novel aspects of social media-based public health information dissemination, including a comparison of its characteristics with search engine-based Web document retrieval. A preliminary analysis of a sample of public health advice tweets taken from a larger sample of over 4700 tweets sent by Australian health-related organization in February 2012 is described. Various preliminary measures are analyzed from this data to initially suggest possible characteristics of public health information dissemination and document evaluation in micro-blog-based systems based on this sample.","PeriodicalId":402985,"journal":{"name":"Australasian Document Computing Symposium","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130326470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Exploiting medical hierarchies for concept-based information retrieval 利用医学层次结构进行基于概念的信息检索
Pub Date : 2012-12-05 DOI: 10.1145/2407085.2407100
G. Zuccon, B. Koopman, Anthony N. Nguyen, D. Vickers, Luke Butt
Search technologies are critical to enable clinical staff to rapidly and effectively access patient information contained in free-text medical records. Medical search is challenging as terms in the query are often general but those in relevant documents are very specific, leading to granularity mismatch. In this paper we propose to tackle granularity mismatch by exploiting subsumption relationships defined in formal medical domain knowledge resources. In symbolic reasoning, a subsumption (or 'is-a') relationship is a parent-child relationship where one concept is a subset of another concept. Subsumed concepts are included in the retrieval function. In addition, we investigate a number of initial methods for combining weights of query concepts and those of subsumed concepts. Subsumption relationships were found to provide strong indication of relevant information; their inclusion in retrieval functions yields performance improvements. This result motivates the development of formal models of relationships between medical concepts for retrieval purposes.
搜索技术对于使临床工作人员能够快速有效地访问自由文本医疗记录中包含的患者信息至关重要。医疗搜索具有挑战性,因为查询中的术语通常是通用的,而相关文档中的术语则非常具体,从而导致粒度不匹配。本文提出利用正式医学领域知识资源中定义的包容关系来解决粒度不匹配问题。在符号推理中,包含(或“is-a”)关系是一种父子关系,其中一个概念是另一个概念的子集。包含的概念包含在检索函数中。此外,我们还研究了一些将查询概念的权重与被包含概念的权重相结合的初始方法。发现包容关系提供了强有力的相关信息指示;将它们包含在检索函数中可以提高性能。这一结果激发了医学概念之间关系的正式模型的发展,用于检索目的。
{"title":"Exploiting medical hierarchies for concept-based information retrieval","authors":"G. Zuccon, B. Koopman, Anthony N. Nguyen, D. Vickers, Luke Butt","doi":"10.1145/2407085.2407100","DOIUrl":"https://doi.org/10.1145/2407085.2407100","url":null,"abstract":"Search technologies are critical to enable clinical staff to rapidly and effectively access patient information contained in free-text medical records. Medical search is challenging as terms in the query are often general but those in relevant documents are very specific, leading to granularity mismatch.\u0000 In this paper we propose to tackle granularity mismatch by exploiting subsumption relationships defined in formal medical domain knowledge resources. In symbolic reasoning, a subsumption (or 'is-a') relationship is a parent-child relationship where one concept is a subset of another concept. Subsumed concepts are included in the retrieval function. In addition, we investigate a number of initial methods for combining weights of query concepts and those of subsumed concepts. Subsumption relationships were found to provide strong indication of relevant information; their inclusion in retrieval functions yields performance improvements. This result motivates the development of formal models of relationships between medical concepts for retrieval purposes.","PeriodicalId":402985,"journal":{"name":"Australasian Document Computing Symposium","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127087368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 53
An English-translated parallel corpus for the CJK Wikipedia collections 一个英文翻译的平行语料库,用于CJK维基百科集合
Pub Date : 2012-12-05 DOI: 10.1145/2407085.2407099
Ling-Xiang Tang, S. Geva, A. Trotman
In this paper, we describe a machine-translated parallel English corpus for the NTCIR Chinese, Japanese and Korean (CJK) Wikipedia collections. This document collection is named CJK2E Wikipedia XML corpus. The corpus could be used by the information retrieval research community and knowledge sharing in Wikipedia in many ways; for example, this corpus could be used for experimentations in cross-lingual information retrieval, cross-lingual link discovery, or omni-lingual information retrieval research. Furthermore, the translated CJK articles could be used to further expand the current coverage of the English Wikipedia.
在本文中,我们描述了一个机器翻译的并行英语语料库,用于NTCIR中文,日语和韩语(CJK)维基百科集合。这个文档集合命名为CJK2E Wikipedia XML语料库。该语料库可用于维基百科的信息检索研究团体和知识共享;例如,该语料库可用于跨语言信息检索、跨语言链接发现或全语言信息检索研究的实验。此外,翻译后的CJK文章可用于进一步扩大英语维基百科的现有覆盖范围。
{"title":"An English-translated parallel corpus for the CJK Wikipedia collections","authors":"Ling-Xiang Tang, S. Geva, A. Trotman","doi":"10.1145/2407085.2407099","DOIUrl":"https://doi.org/10.1145/2407085.2407099","url":null,"abstract":"In this paper, we describe a machine-translated parallel English corpus for the NTCIR Chinese, Japanese and Korean (CJK) Wikipedia collections. This document collection is named CJK2E Wikipedia XML corpus. The corpus could be used by the information retrieval research community and knowledge sharing in Wikipedia in many ways; for example, this corpus could be used for experimentations in cross-lingual information retrieval, cross-lingual link discovery, or omni-lingual information retrieval research. Furthermore, the translated CJK articles could be used to further expand the current coverage of the English Wikipedia.","PeriodicalId":402985,"journal":{"name":"Australasian Document Computing Symposium","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122135635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Efficient indexing algorithms for approximate pattern matching in text 文本中近似模式匹配的高效索引算法
Pub Date : 2012-12-05 DOI: 10.1145/2407085.2407087
M. Petri, M. Petri, J. Culpepper
Approximate pattern matching is an important computational problem with a wide variety of applications in Information Retrieval. Efficient solutions to approximate pattern matching can be applied to natural language keyword queries with spelling mistakes, OCR scanned text incorporated into indexes, language model ranking algorithms based on term proximity, or DNA databases containing sequencing errors. In this paper, we present a novel approach to constructing text indexes capable of efficiently supporting approximate search queries. Our approach relies on a new variant of the Context Bound Burrows-Wheeler Transform (k-bwt), referred to as the Variable Depth Burrows-Wheeler Transform (v-bwt). First, we describe our new algorithm, and show that it is reversible. Next, we show how to use the transform to support efficient text indexing and approximate pattern matching. Lastly, we empirically evaluate the use of the v-bwt for DNA and English text collections, and show a significant improvement in approximate search efficiency over more traditional q-gram based approximate pattern matching algorithms.
近似模式匹配是一个重要的计算问题,在信息检索中有着广泛的应用。近似模式匹配的有效解决方案可以应用于包含拼写错误的自然语言关键字查询、包含索引的OCR扫描文本、基于术语接近度的语言模型排序算法或包含测序错误的DNA数据库。在本文中,我们提出了一种新的方法来构建能够有效支持近似搜索查询的文本索引。我们的方法依赖于上下文绑定Burrows-Wheeler变换(k-bwt)的一种新变体,称为变深度Burrows-Wheeler变换(v-bwt)。首先,我们描述了我们的新算法,并证明了它是可逆的。接下来,我们将展示如何使用转换来支持有效的文本索引和近似模式匹配。最后,我们对v-bwt在DNA和英语文本集合中的使用进行了实证评估,结果表明,与传统的基于q-gram的近似模式匹配算法相比,v-bwt在近似搜索效率方面有显著提高。
{"title":"Efficient indexing algorithms for approximate pattern matching in text","authors":"M. Petri, M. Petri, J. Culpepper","doi":"10.1145/2407085.2407087","DOIUrl":"https://doi.org/10.1145/2407085.2407087","url":null,"abstract":"Approximate pattern matching is an important computational problem with a wide variety of applications in Information Retrieval. Efficient solutions to approximate pattern matching can be applied to natural language keyword queries with spelling mistakes, OCR scanned text incorporated into indexes, language model ranking algorithms based on term proximity, or DNA databases containing sequencing errors. In this paper, we present a novel approach to constructing text indexes capable of efficiently supporting approximate search queries. Our approach relies on a new variant of the Context Bound Burrows-Wheeler Transform (k-bwt), referred to as the Variable Depth Burrows-Wheeler Transform (v-bwt). First, we describe our new algorithm, and show that it is reversible. Next, we show how to use the transform to support efficient text indexing and approximate pattern matching. Lastly, we empirically evaluate the use of the v-bwt for DNA and English text collections, and show a significant improvement in approximate search efficiency over more traditional q-gram based approximate pattern matching algorithms.","PeriodicalId":402985,"journal":{"name":"Australasian Document Computing Symposium","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125600310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
An ontology derived from heterogeneous sustainability indicator set documents 从异构可持续性指标集文档派生的本体
Pub Date : 2012-12-05 DOI: 10.1145/2407085.2407095
L. Ghahremanloo, J. Thom, L. Magee
We present an ontology to represent the key concepts of sustainability indicators that are increasingly being used to measure the economic, environmental and social properties of complex systems. There have been few efforts to represent multiple indicators formally, in spite of the fact that comparison of indicators and measurements across reporting contexts is a critical task. In this paper, we apply the METHONTOLOGY approach to guide the construction of two design candidates we term Generic and Specific. Of the two, the generic design is more abstract, with fewer classes and properties. Documents describing two indicator systems - the Global Reporting Initiative and the Organisation for Economic Co-operation and Development -- are used in the design of both candidate ontologies. We then evaluate both ontology designs using the ROMEO approach, to calculate their level of coverage against the seen indicators, as well as against an unseen third indicator set (the United Nations Statistics Division). We also show that use of existing structured approaches like METHONTOLOGY and ROMEO can reduce ambiguity in ontology design and evaluation for domain-level ontologies. It is concluded that where an ontology needs to be designed for both seen and unseen indicator systems, a generic and reusable design is preferable.
我们提出了一个本体来代表可持续性指标的关键概念,这些指标越来越多地被用于衡量复杂系统的经济、环境和社会属性。尽管跨报告背景的指标和测量的比较是一项关键任务,但正式表示多个指标的努力很少。在本文中,我们应用方法论的方法来指导两个候选设计的构建,我们称之为通用和特定。在这两种设计中,泛型设计更抽象,类和属性更少。描述两个指标体系的文件——全球报告倡议组织(Global Reporting Initiative)和经济合作与发展组织(oecd)——被用于设计两个候选本体。然后,我们使用ROMEO方法评估这两种本体设计,根据可见指标以及不可见的第三个指标集(联合国统计司)计算它们的覆盖水平。我们还表明,使用现有的结构化方法(如METHONTOLOGY和ROMEO)可以减少领域级本体设计和评估中的歧义。结论是,当需要为可见和不可见的指示系统设计本体时,通用和可重用的设计是可取的。
{"title":"An ontology derived from heterogeneous sustainability indicator set documents","authors":"L. Ghahremanloo, J. Thom, L. Magee","doi":"10.1145/2407085.2407095","DOIUrl":"https://doi.org/10.1145/2407085.2407095","url":null,"abstract":"We present an ontology to represent the key concepts of sustainability indicators that are increasingly being used to measure the economic, environmental and social properties of complex systems. There have been few efforts to represent multiple indicators formally, in spite of the fact that comparison of indicators and measurements across reporting contexts is a critical task. In this paper, we apply the METHONTOLOGY approach to guide the construction of two design candidates we term Generic and Specific. Of the two, the generic design is more abstract, with fewer classes and properties. Documents describing two indicator systems - the Global Reporting Initiative and the Organisation for Economic Co-operation and Development -- are used in the design of both candidate ontologies. We then evaluate both ontology designs using the ROMEO approach, to calculate their level of coverage against the seen indicators, as well as against an unseen third indicator set (the United Nations Statistics Division). We also show that use of existing structured approaches like METHONTOLOGY and ROMEO can reduce ambiguity in ontology design and evaluation for domain-level ontologies. It is concluded that where an ontology needs to be designed for both seen and unseen indicator systems, a generic and reusable design is preferable.","PeriodicalId":402985,"journal":{"name":"Australasian Document Computing Symposium","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129101975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
An attempt to measure the quality of questions in question time of the Australian Federal Parliament 试图在澳大利亚联邦议会的质询时间中衡量问题的质量
Pub Date : 2012-12-05 DOI: 10.1145/2407085.2407098
A. Turpin
This paper uses standard information retrieval techniques to measure the quality of information exchange during Question Time in the Australian Federal Parliament's House of Representatives from 1998 to 2012. A search engine is used to index all answers to questions, and then runs each question as a query, recording the rank of the actual answer in the returned list of documents. Using this rank as a measure of quality, Question Time has deteriorated over the last decade. The main deterioration has been in information exchange in "Dorothy Dixer" questions. The corpus used for this study is available from the author's web page for further investigations.
本文采用标准信息检索技术对1998年至2012年澳大利亚联邦议会众议院质询时间的信息交换质量进行了测量。搜索引擎用于索引所有问题的答案,然后将每个问题作为查询运行,记录实际答案在返回的文档列表中的排名。用这个排名作为质量的衡量标准,提问时间在过去的十年里已经恶化了。“多萝西·迪克斯”问题中的信息交换是主要的退化。本研究使用的语料库可从作者的网页上获得,以供进一步研究。
{"title":"An attempt to measure the quality of questions in question time of the Australian Federal Parliament","authors":"A. Turpin","doi":"10.1145/2407085.2407098","DOIUrl":"https://doi.org/10.1145/2407085.2407098","url":null,"abstract":"This paper uses standard information retrieval techniques to measure the quality of information exchange during Question Time in the Australian Federal Parliament's House of Representatives from 1998 to 2012. A search engine is used to index all answers to questions, and then runs each question as a query, recording the rank of the actual answer in the returned list of documents. Using this rank as a measure of quality, Question Time has deteriorated over the last decade. The main deterioration has been in information exchange in \"Dorothy Dixer\" questions. The corpus used for this study is available from the author's web page for further investigations.","PeriodicalId":402985,"journal":{"name":"Australasian Document Computing Symposium","volume":"311 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134117587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Models and metrics: IR evaluation as a user process 模型和度量:作为用户过程的IR评估
Pub Date : 2012-12-05 DOI: 10.1145/2407085.2407092
Alistair Moffat, Falk Scholer, Paul Thomas
Retrieval system effectiveness can be measured in two quite different ways: by monitoring the behavior of users and gathering data about the ease and accuracy with which they accomplish certain specified information-seeking tasks; or by using numeric effectiveness metrics to score system runs in reference to a set of relevance judgments. The former has the benefit of directly assessing the actual goal of the system, namely the user's ability to complete a search task; whereas the latter approach has the benefit of being quantitative and repeatable. Each given effectiveness metric is an attempt to bridge the gap between these two evaluation approaches, since the implicit belief supporting the use of any particular metric is that user task performance should be correlated with the numeric score provided by the metric. In this work we explore that linkage, considering a range of effectiveness metrics, and the user search behavior that each of them implies. We then examine more complex user models, as a guide to the development of new effectiveness metrics. We conclude by summarizing an experiment that we believe will help establish the strength of the linkage between models and metrics.
检索系统的有效性可以通过两种截然不同的方式来衡量:通过监测用户的行为并收集有关用户完成某些特定信息搜索任务的容易程度和准确性的数据;或者通过使用数字有效性度量来根据一组相关判断对系统运行进行评分。前者的好处是可以直接评估系统的实际目标,即用户完成搜索任务的能力;而后一种方法具有定量和可重复的优点。每个给定的有效性度量都试图弥合这两种评估方法之间的差距,因为支持使用任何特定度量的隐含信念是,用户任务性能应该与度量提供的数字分数相关联。在这项工作中,我们探讨了这种联系,考虑了一系列的有效性指标,以及每个指标所暗示的用户搜索行为。然后,我们将研究更复杂的用户模型,作为开发新的有效性指标的指南。我们通过总结一个实验来结束本文,我们相信这个实验将有助于建立模型和度量之间联系的强度。
{"title":"Models and metrics: IR evaluation as a user process","authors":"Alistair Moffat, Falk Scholer, Paul Thomas","doi":"10.1145/2407085.2407092","DOIUrl":"https://doi.org/10.1145/2407085.2407092","url":null,"abstract":"Retrieval system effectiveness can be measured in two quite different ways: by monitoring the behavior of users and gathering data about the ease and accuracy with which they accomplish certain specified information-seeking tasks; or by using numeric effectiveness metrics to score system runs in reference to a set of relevance judgments. The former has the benefit of directly assessing the actual goal of the system, namely the user's ability to complete a search task; whereas the latter approach has the benefit of being quantitative and repeatable. Each given effectiveness metric is an attempt to bridge the gap between these two evaluation approaches, since the implicit belief supporting the use of any particular metric is that user task performance should be correlated with the numeric score provided by the metric. In this work we explore that linkage, considering a range of effectiveness metrics, and the user search behavior that each of them implies. We then examine more complex user models, as a guide to the development of new effectiveness metrics. We conclude by summarizing an experiment that we believe will help establish the strength of the linkage between models and metrics.","PeriodicalId":402985,"journal":{"name":"Australasian Document Computing Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114986784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
Effects of spam removal on search engine efficiency and effectiveness 垃圾邮件清除对搜索引擎效率和效果的影响
Pub Date : 2012-12-05 DOI: 10.1145/2407085.2407086
Matt Crane, A. Trotman
Spam has long been identified as a problem that web search engines are required to deal with. Large collection sizes are also an increasing issue for institutions that do not have the necessary resources to process them in their entirety. In this paper we investigate the effect that withholding documents identified as spam has on the resources required to process large collections. We also investigate the resulting search effectiveness and efficiency when different amounts of spam are withheld. We find that by removing spam at indexing time we are able to decrease the index size without affecting the indexing throughput, and are able to improve search precision for some thresholds.
垃圾邮件一直被认为是网络搜索引擎必须处理的问题。对于没有必要的资源来完整处理它们的机构来说,大的收集规模也是一个日益严重的问题。在本文中,我们研究了扣留标识为垃圾邮件的文档对处理大型集合所需资源的影响。我们还研究了扣留不同数量的垃圾邮件时产生的搜索效果和效率。我们发现,通过在索引时删除垃圾邮件,我们能够在不影响索引吞吐量的情况下减少索引大小,并且能够提高某些阈值的搜索精度。
{"title":"Effects of spam removal on search engine efficiency and effectiveness","authors":"Matt Crane, A. Trotman","doi":"10.1145/2407085.2407086","DOIUrl":"https://doi.org/10.1145/2407085.2407086","url":null,"abstract":"Spam has long been identified as a problem that web search engines are required to deal with. Large collection sizes are also an increasing issue for institutions that do not have the necessary resources to process them in their entirety. In this paper we investigate the effect that withholding documents identified as spam has on the resources required to process large collections. We also investigate the resulting search effectiveness and efficiency when different amounts of spam are withheld. We find that by removing spam at indexing time we are able to decrease the index size without affecting the indexing throughput, and are able to improve search precision for some thresholds.","PeriodicalId":402985,"journal":{"name":"Australasian Document Computing Symposium","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123938761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Comparing scanning behaviour in web search on small and large screens 在小屏幕和大屏幕上比较网页搜索的扫描行为
Pub Date : 2012-12-05 DOI: 10.1145/2407085.2407089
Jaewon Kim, Paul Thomas, R. Sankaranarayana, Tom Gedeon
Although web search on mobile devices is common, little is known about how users read search result lists on a small screen. We used eye tracking to compare users' scanning behaviour of web search engine result pages on a small screen (hand-held devices) and a large screen (desktops or laptops). The objective was to determine whether search result pages should be designed differently for mobile devices. To compare scanning behaviour, we considered only the fixation time and scanning strategy using our new method called 'Trackback'. The results showed that on a small screen, users spend relatively more time to conduct a search than they do on a large screen, despite tending to look less far ahead beyond the link that they eventually select. They also show a stronger tendency to seek information within the top three results on a small screen than on a large screen. The reason for this tendency may be difficulties in reading and the relative location of page folds. The results clearly indicated that scanning behaviour during web search on a small screen is different from that on a large screen. Thus, research efforts should be invested in improving the presentation of search engine result pages on small screens, taking scanning behaviour into account. This will help provide a better search experience in terms of search time, accuracy of finding correct links, and user satisfaction.
尽管移动设备上的网络搜索很常见,但人们对用户如何在小屏幕上阅读搜索结果列表知之甚少。我们使用眼动追踪来比较用户在小屏幕(手持设备)和大屏幕(台式机或笔记本电脑)上浏览网络搜索引擎结果页面的行为。目的是确定搜索结果页面是否应该针对移动设备进行不同的设计。为了比较扫描行为,我们只考虑固定时间和扫描策略,使用我们的新方法“Trackback”。结果显示,在小屏幕上,用户花在搜索上的时间比在大屏幕上要多,尽管他们倾向于在最终选择的链接之外看得更远。他们也更倾向于在小屏幕上搜索前三个结果,而不是在大屏幕上。造成这种趋势的原因可能是阅读困难和页面折叠的相对位置。结果清楚地表明,在小屏幕上进行网络搜索时的扫描行为与在大屏幕上的扫描行为不同。因此,研究工作应该投入到改进小屏幕上搜索引擎结果页面的呈现上,并考虑到扫描行为。这将有助于在搜索时间、找到正确链接的准确性和用户满意度方面提供更好的搜索体验。
{"title":"Comparing scanning behaviour in web search on small and large screens","authors":"Jaewon Kim, Paul Thomas, R. Sankaranarayana, Tom Gedeon","doi":"10.1145/2407085.2407089","DOIUrl":"https://doi.org/10.1145/2407085.2407089","url":null,"abstract":"Although web search on mobile devices is common, little is known about how users read search result lists on a small screen. We used eye tracking to compare users' scanning behaviour of web search engine result pages on a small screen (hand-held devices) and a large screen (desktops or laptops). The objective was to determine whether search result pages should be designed differently for mobile devices. To compare scanning behaviour, we considered only the fixation time and scanning strategy using our new method called 'Trackback'. The results showed that on a small screen, users spend relatively more time to conduct a search than they do on a large screen, despite tending to look less far ahead beyond the link that they eventually select. They also show a stronger tendency to seek information within the top three results on a small screen than on a large screen. The reason for this tendency may be difficulties in reading and the relative location of page folds. The results clearly indicated that scanning behaviour during web search on a small screen is different from that on a large screen. Thus, research efforts should be invested in improving the presentation of search engine result pages on small screens, taking scanning behaviour into account. This will help provide a better search experience in terms of search time, accuracy of finding correct links, and user satisfaction.","PeriodicalId":402985,"journal":{"name":"Australasian Document Computing Symposium","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132741413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Finding additional semantic entity information for search engines 为搜索引擎查找额外的语义实体信息
Pub Date : 2012-12-05 DOI: 10.1145/2407085.2407101
Jun Hou, R. Nayak, Jinglan Zhang
Entity-oriented search has become an essential component of modern search engines. It focuses on retrieving a list of entities or information about the specific entities instead of documents. In this paper, we study the problem of finding entity related information, referred to as attribute-value pairs, that play a significant role in searching target entities. We propose a novel decomposition framework combining reduced relations and the discriminative model, Conditional Random Field (CRF), for automatically finding entity-related attribute-value pairs from free text documents. This decomposition framework allows us to locate potential text fragments and identify the hidden semantics, in the form of attribute-value pairs for user queries. Empirical analysis shows that the decomposition framework outperforms pattern-based approaches due to its capability of effective integration of syntactic and semantic features.
面向实体的搜索已经成为现代搜索引擎的重要组成部分。它侧重于检索实体列表或关于特定实体的信息,而不是文档。在本文中,我们研究了实体相关信息的查找问题,即属性值对,它在搜索目标实体中起着重要的作用。本文提出了一种结合约简关系和判别模型的分解框架——条件随机场(Conditional Random Field, CRF),用于从自由文本文档中自动发现实体相关的属性值对。这个分解框架允许我们定位潜在的文本片段,并以用户查询的属性-值对的形式识别隐藏的语义。实证分析表明,该分解框架能够有效地整合句法和语义特征,优于基于模式的分解方法。
{"title":"Finding additional semantic entity information for search engines","authors":"Jun Hou, R. Nayak, Jinglan Zhang","doi":"10.1145/2407085.2407101","DOIUrl":"https://doi.org/10.1145/2407085.2407101","url":null,"abstract":"Entity-oriented search has become an essential component of modern search engines. It focuses on retrieving a list of entities or information about the specific entities instead of documents. In this paper, we study the problem of finding entity related information, referred to as attribute-value pairs, that play a significant role in searching target entities. We propose a novel decomposition framework combining reduced relations and the discriminative model, Conditional Random Field (CRF), for automatically finding entity-related attribute-value pairs from free text documents. This decomposition framework allows us to locate potential text fragments and identify the hidden semantics, in the form of attribute-value pairs for user queries. Empirical analysis shows that the decomposition framework outperforms pattern-based approaches due to its capability of effective integration of syntactic and semantic features.","PeriodicalId":402985,"journal":{"name":"Australasian Document Computing Symposium","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131878496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Australasian Document Computing Symposium
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1