首页 > 最新文献

Proceedings of the 21st ACM international conference on Information and knowledge management最新文献

英文 中文
Authentication of moving range queries 移动范围查询的认证
Duncan Yung, Eric Lo, Man Lung Yiu
A moving range query continuously reports the query result (e.g., restaurants) that are within radius $r$ from a moving query point (e.g., moving tourist). To minimize the communication cost with the mobile clients, a service provider that evaluates moving range queries also returns a safe region that bounds the validity of query results. However, an untrustworthy service provider may report incorrect safe regions to mobile clients. In this paper, we present efficient techniques for authenticating the safe regions of moving range queries. We theoretically proved that our methods for authenticating moving range queries can minimize the data sent between the service provider and the mobile clients. Extensive experiments are carried out using both real and synthetic datasets and results show that our methods incur small communication costs and overhead.
移动范围查询连续报告从移动查询点(例如,移动游客)半径$r$内的查询结果(例如,餐馆)。为了最小化与移动客户端的通信成本,评估移动范围查询的服务提供商还返回一个安全区域,该区域限制查询结果的有效性。但是,不可信的服务提供商可能会向移动客户端报告错误的安全区域。在本文中,我们提出了验证移动距离查询安全区域的有效技术。我们从理论上证明了我们的验证移动范围查询的方法可以最大限度地减少服务提供商和移动客户端之间发送的数据。使用真实数据集和合成数据集进行了大量的实验,结果表明我们的方法只需要很小的通信成本和开销。
{"title":"Authentication of moving range queries","authors":"Duncan Yung, Eric Lo, Man Lung Yiu","doi":"10.1145/2396761.2398441","DOIUrl":"https://doi.org/10.1145/2396761.2398441","url":null,"abstract":"A moving range query continuously reports the query result (e.g., restaurants) that are within radius $r$ from a moving query point (e.g., moving tourist). To minimize the communication cost with the mobile clients, a service provider that evaluates moving range queries also returns a safe region that bounds the validity of query results. However, an untrustworthy service provider may report incorrect safe regions to mobile clients. In this paper, we present efficient techniques for authenticating the safe regions of moving range queries. We theoretically proved that our methods for authenticating moving range queries can minimize the data sent between the service provider and the mobile clients. Extensive experiments are carried out using both real and synthetic datasets and results show that our methods incur small communication costs and overhead.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115324223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Fast candidate generation for two-phase document ranking: postings list intersection with bloom filters 快速候选生成两阶段文档排名:帖子列表交集与布隆过滤器
N. Asadi, Jimmy J. Lin
Most modern web search engines employ a two-phase ranking strategy: a candidate list of documents is generated using a "cheap" but low-quality scoring function, which is then reranked by an "expensive" but high-quality method (usually machine-learned). This paper focuses on the problem of candidate generation for conjunctive query processing in this context. We describe and evaluate a fast, approximate postings list intersection algorithms based on Bloom filters. Due to the power of modern learning-to-rank techniques and emphasis on early precision, significant speedups can be achieved without loss of end-to-end retrieval effectiveness. Explorations reveal a rich design space where effectiveness and efficiency can be balanced in response to specific hardware configurations and application scenarios.
大多数现代网络搜索引擎采用两阶段排序策略:使用“廉价”但低质量的评分函数生成候选文档列表,然后使用“昂贵”但高质量的方法(通常是机器学习)重新排序。本文重点研究了在此背景下联合查询处理的候选对象生成问题。我们描述并评估了一种快速、近似的基于Bloom过滤器的帖子列表交叉算法。由于现代学习排序技术的力量和对早期精度的强调,可以在不损失端到端检索效率的情况下实现显着的加速。探索揭示了丰富的设计空间,其中可以根据特定的硬件配置和应用场景平衡有效性和效率。
{"title":"Fast candidate generation for two-phase document ranking: postings list intersection with bloom filters","authors":"N. Asadi, Jimmy J. Lin","doi":"10.1145/2396761.2398656","DOIUrl":"https://doi.org/10.1145/2396761.2398656","url":null,"abstract":"Most modern web search engines employ a two-phase ranking strategy: a candidate list of documents is generated using a \"cheap\" but low-quality scoring function, which is then reranked by an \"expensive\" but high-quality method (usually machine-learned). This paper focuses on the problem of candidate generation for conjunctive query processing in this context. We describe and evaluate a fast, approximate postings list intersection algorithms based on Bloom filters. Due to the power of modern learning-to-rank techniques and emphasis on early precision, significant speedups can be achieved without loss of end-to-end retrieval effectiveness. Explorations reveal a rich design space where effectiveness and efficiency can be balanced in response to specific hardware configurations and application scenarios.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115435390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
I want what i need!: analyzing subjectivity of online forum threads 我想要我想要的!:网络论坛话题主体性分析
P. Biyani, Cornelia Caragea, Amit Singh, P. Mitra
Online forums have become a popular source of information due to the unique nature of information they contain. Internet users use these forums to get opinions of other people on issues and to find factual answers to specific questions. Topics discussed in online forum threads can be subjective seeking personal opinions or non-subjective seeking factual information. Hence, knowing subjectivity orientation of threads would help forum search engines to satisfy user's information needs more effectively by matching the subjectivities of user's query and topics discussed in the threads in addition to lexical match between the two. We study methods to analyze the subjectivity of online forum threads. Experimental results on a popular online forum demonstrate the effectiveness of our methods.
由于在线论坛所包含的信息的独特性,它已经成为一种受欢迎的信息来源。互联网用户使用这些论坛来获取其他人对问题的意见,并找到具体问题的事实答案。在线论坛讨论的主题可以是主观地寻求个人意见,也可以是非主观地寻求事实信息。因此,了解主题的主体性取向,将用户查询的主体性与主题进行匹配,并将主题与主题进行词汇匹配,有助于论坛搜索引擎更有效地满足用户的信息需求。我们研究了网络论坛主题主体性分析的方法。在一个流行的在线论坛上的实验结果证明了我们方法的有效性。
{"title":"I want what i need!: analyzing subjectivity of online forum threads","authors":"P. Biyani, Cornelia Caragea, Amit Singh, P. Mitra","doi":"10.1145/2396761.2398675","DOIUrl":"https://doi.org/10.1145/2396761.2398675","url":null,"abstract":"Online forums have become a popular source of information due to the unique nature of information they contain. Internet users use these forums to get opinions of other people on issues and to find factual answers to specific questions. Topics discussed in online forum threads can be subjective seeking personal opinions or non-subjective seeking factual information. Hence, knowing subjectivity orientation of threads would help forum search engines to satisfy user's information needs more effectively by matching the subjectivities of user's query and topics discussed in the threads in addition to lexical match between the two. We study methods to analyze the subjectivity of online forum threads. Experimental results on a popular online forum demonstrate the effectiveness of our methods.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116791452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
A model-based approach for RFID data stream cleansing 基于模型的RFID数据流清理方法
Zhou Zhao, Wilfred Ng
In recent years, RFID technologies have been used in many applications, such as inventory checking and object tracking. However, raw RFID data are inherently unreliable due to physical device limitations and different kinds of environmental noise. Currently, existing work mainly focuses on RFID data cleansing in a static environment (e.g. inventory checking). It is therefore difficult to cleanse RFID data streams in a mobile environment (e.g. object tracking) using the existing solutions, which do not address the data missing issue effectively. In this paper, we study how to cleanse RFID data streams for object tracking, which is a challenging problem, since a significant percentage of readings are routinely dropped. We propose a probabilistic model for object tracking in a mobile environment. We develop a Bayesian inference based approach for cleansing RFID data using the model. In order to sample data from the movement distribution, we devise a sequential sampler that cleans RFID data with high accuracy and efficiency. We validate the effectiveness and robustness of our solution through extensive simulations and demonstrate its performance by using two real RFID applications of human tracking and conveyor belt monitoring.
近年来,RFID技术已被应用于许多领域,如库存检查和目标跟踪。然而,由于物理设备的限制和不同种类的环境噪声,原始RFID数据本质上是不可靠的。目前,现有的工作主要集中在静态环境下的RFID数据清理(例如库存检查)。因此,使用现有的解决方案很难在移动环境(例如对象跟踪)中清理RFID数据流,这些解决方案不能有效地解决数据丢失问题。在本文中,我们研究了如何清理RFID数据流以进行对象跟踪,这是一个具有挑战性的问题,因为通常会丢失相当大比例的读数。提出了一种用于移动环境下目标跟踪的概率模型。我们开发了一种基于贝叶斯推理的方法,用于使用该模型清洗RFID数据。为了从运动样本数据分布,我们设计一个顺序取样器清洁RFID数据准确性和效率高。我们通过广泛的模拟验证了我们的解决方案的有效性和鲁棒性,并通过使用人体跟踪和传送带监控的两个真实RFID应用来展示其性能。
{"title":"A model-based approach for RFID data stream cleansing","authors":"Zhou Zhao, Wilfred Ng","doi":"10.1145/2396761.2396871","DOIUrl":"https://doi.org/10.1145/2396761.2396871","url":null,"abstract":"In recent years, RFID technologies have been used in many applications, such as inventory checking and object tracking. However, raw RFID data are inherently unreliable due to physical device limitations and different kinds of environmental noise. Currently, existing work mainly focuses on RFID data cleansing in a static environment (e.g. inventory checking). It is therefore difficult to cleanse RFID data streams in a mobile environment (e.g. object tracking) using the existing solutions, which do not address the data missing issue effectively. In this paper, we study how to cleanse RFID data streams for object tracking, which is a challenging problem, since a significant percentage of readings are routinely dropped. We propose a probabilistic model for object tracking in a mobile environment. We develop a Bayesian inference based approach for cleansing RFID data using the model. In order to sample data from the movement distribution, we devise a sequential sampler that cleans RFID data with high accuracy and efficiency. We validate the effectiveness and robustness of our solution through extensive simulations and demonstrate its performance by using two real RFID applications of human tracking and conveyor belt monitoring.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116916864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
Finding nuggets in IP portfolios: core patent mining through textual temporal analysis 在知识产权组合中寻找掘金:通过文本时间分析挖掘核心专利
Po Hu, Minlie Huang, Peng Xu, Weichang Li, A. Usadi, Xiaoyan Zhu
Patents are critical for a company to protect its core technologies. Effective patent mining in massive patent databases can provide companies with valuable insights to develop strategies for IP management and marketing. In this paper, we study a novel patent mining problem of automatically discovering core patents (i.e., patents with high novelty and influence in a domain). We address the unique patent vocabulary usage problem, which is not considered in traditional word-based statistical methods, and propose a topic-based temporal mining approach to quantify a patent's novelty and influence. Comprehensive experimental results on real-world patent portfolios show the effectiveness of our method.
专利对于企业保护其核心技术至关重要。在海量专利数据库中进行有效的专利挖掘,可以为企业制定知识产权管理和营销策略提供有价值的见解。本文研究了一种新的专利挖掘问题,即自动发现核心专利(即在某一领域具有较高新颖性和影响力的专利)。我们解决了传统的基于单词的统计方法没有考虑到的独特的专利词汇使用问题,并提出了一种基于主题的时间挖掘方法来量化专利的新颖性和影响力。实际专利组合的综合实验结果表明了该方法的有效性。
{"title":"Finding nuggets in IP portfolios: core patent mining through textual temporal analysis","authors":"Po Hu, Minlie Huang, Peng Xu, Weichang Li, A. Usadi, Xiaoyan Zhu","doi":"10.1145/2396761.2398524","DOIUrl":"https://doi.org/10.1145/2396761.2398524","url":null,"abstract":"Patents are critical for a company to protect its core technologies. Effective patent mining in massive patent databases can provide companies with valuable insights to develop strategies for IP management and marketing. In this paper, we study a novel patent mining problem of automatically discovering core patents (i.e., patents with high novelty and influence in a domain). We address the unique patent vocabulary usage problem, which is not considered in traditional word-based statistical methods, and propose a topic-based temporal mining approach to quantify a patent's novelty and influence. Comprehensive experimental results on real-world patent portfolios show the effectiveness of our method.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121149754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Generating facets for phone-based navigation of structured data 为基于手机的结构化数据导航生成facet
Krishna Kummamuru, Ajith Jujjuru, Mayuri Duggirala
Designing interactive voice systems that have optimum cognitive load on callers has been an active research topic for quite some time. There have been many studies comparing the user preferences on navigation trees with higher depths over higher breadths. In this paper, we consider the navigation of structured data containing various types of attributes using phone-based interactions. This problem is particularly relevant to emerging economies in which innovative voice-based applications are being built to address semi-literate population. We address the problem of identifying the right sequence of facets to be presented to the user for phone-based navigation of the data in two stages. Firstly, we perform extensive user studies in the target population to understand the relation between the nature of facets (attributes) of the data and the cognitive load. Secondly, we propose an algorithm to design optimum navigation trees based on the inferences made in the first phase. We compare the proposed algorithm with the traditional facet generation algorithms with respect to various factors and discuss the optimality of the proposed algorithm.
设计对呼叫者具有最佳认知负荷的交互式语音系统一直是一个活跃的研究课题。已经有很多研究比较了用户对导航树深度和宽度的偏好。在本文中,我们考虑使用基于手机的交互来导航包含各种类型属性的结构化数据。这个问题与新兴经济体尤其相关,新兴经济体正在开发基于语音的创新应用程序,以解决半文盲人口的问题。我们在两个阶段解决了识别要呈现给用户的基于手机的数据导航的正确序列的问题。首先,我们在目标人群中进行了广泛的用户研究,以了解数据方面(属性)的性质与认知负荷之间的关系。其次,我们提出了一种基于第一阶段推理的最优导航树设计算法。我们将所提出的算法与传统的面生成算法在各种因素方面进行比较,并讨论所提出算法的最优性。
{"title":"Generating facets for phone-based navigation of structured data","authors":"Krishna Kummamuru, Ajith Jujjuru, Mayuri Duggirala","doi":"10.1145/2396761.2398431","DOIUrl":"https://doi.org/10.1145/2396761.2398431","url":null,"abstract":"Designing interactive voice systems that have optimum cognitive load on callers has been an active research topic for quite some time. There have been many studies comparing the user preferences on navigation trees with higher depths over higher breadths. In this paper, we consider the navigation of structured data containing various types of attributes using phone-based interactions. This problem is particularly relevant to emerging economies in which innovative voice-based applications are being built to address semi-literate population. We address the problem of identifying the right sequence of facets to be presented to the user for phone-based navigation of the data in two stages. Firstly, we perform extensive user studies in the target population to understand the relation between the nature of facets (attributes) of the data and the cognitive load. Secondly, we propose an algorithm to design optimum navigation trees based on the inferences made in the first phase. We compare the proposed algorithm with the traditional facet generation algorithms with respect to various factors and discuss the optimality of the proposed algorithm.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127498246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Comprehension-based result snippets 基于理解的结果片段
Abhijith Kashyap, Vagelis Hristidis
Result snippets are used by most search interfaces to preview query results. Snippets help users quickly decide the relevance of the results, thereby reducing the overall search time and effort. Most work on snippets have focused on text snippets for Web pages in Web search. However, little work has studied the problem of snippets for structured data, e.g., product catalogs. Furthermore, all works have focused on the important goal of creating informative snippets, but have ignored the amount of user effort required to comprehend, i.e., read and digest, the displayed snippets. In particular, they implicitly assume that the comprehension effort or cost only depends on the length of the snippet, which we show is incorrect for structured data. We propose novel techniques to construct snippets of structured heterogeneous results, which not only select the most informative attributes for each result, but also minimize the expected user effort (time) to comprehend these snippets. We create a comprehension model to quantify the effort incurred by users in comprehending a list of result snippets. Our model is supported by an extensive user-study. A key observation is that the user effort for comprehending an attribute across multiple snippets only depends on the number of unique positions (e.g., indentations) where this attribute is displayed and not on the number of occurrences. We analyze the complexity of the snippet construction problem and show that the problem is NP-hard, even when we only consider the comprehension cost. We present efficient approximate algorithms, and experimentally demonstrate their effectiveness and efficiency.
大多数搜索界面都使用结果片段来预览查询结果。摘要帮助用户快速确定结果的相关性,从而减少整体搜索的时间和精力。大多数关于片段的工作都集中在Web搜索中Web页面的文本片段上。然而,很少有研究结构化数据片段问题的工作,例如产品目录。此外,所有的工作都集中在创建信息片段的重要目标上,但忽略了用户理解所需的工作量,即阅读和消化显示的片段。特别是,它们隐含地假设理解工作或成本仅取决于代码片段的长度,我们认为这对于结构化数据是不正确的。我们提出了构建结构化异构结果片段的新技术,它不仅为每个结果选择最具信息量的属性,而且最大限度地减少了用户理解这些片段的预期工作量(时间)。我们创建了一个理解模型来量化用户在理解结果片段列表时所付出的努力。我们的模型得到了广泛的用户研究的支持。一个关键的观察是,用户在多个片段中理解一个属性的努力只取决于该属性显示的唯一位置(例如,缩进)的数量,而不是出现的次数。我们分析了片段构造问题的复杂性,并表明即使只考虑理解代价,该问题也是np困难的。我们提出了高效的近似算法,并通过实验验证了其有效性和效率。
{"title":"Comprehension-based result snippets","authors":"Abhijith Kashyap, Vagelis Hristidis","doi":"10.1145/2396761.2398405","DOIUrl":"https://doi.org/10.1145/2396761.2398405","url":null,"abstract":"Result snippets are used by most search interfaces to preview query results. Snippets help users quickly decide the relevance of the results, thereby reducing the overall search time and effort. Most work on snippets have focused on text snippets for Web pages in Web search. However, little work has studied the problem of snippets for structured data, e.g., product catalogs. Furthermore, all works have focused on the important goal of creating informative snippets, but have ignored the amount of user effort required to comprehend, i.e., read and digest, the displayed snippets. In particular, they implicitly assume that the comprehension effort or cost only depends on the length of the snippet, which we show is incorrect for structured data. We propose novel techniques to construct snippets of structured heterogeneous results, which not only select the most informative attributes for each result, but also minimize the expected user effort (time) to comprehend these snippets. We create a comprehension model to quantify the effort incurred by users in comprehending a list of result snippets. Our model is supported by an extensive user-study. A key observation is that the user effort for comprehending an attribute across multiple snippets only depends on the number of unique positions (e.g., indentations) where this attribute is displayed and not on the number of occurrences. We analyze the complexity of the snippet construction problem and show that the problem is NP-hard, even when we only consider the comprehension cost. We present efficient approximate algorithms, and experimentally demonstrate their effectiveness and efficiency.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"7 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124909255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Two-part segmentation of text documents 文本文档的两部分分割
Deepak P, Karthik Venkat Ramanan, N. Wiratunga, Sadiq Sani
We consider the problem of segmenting text documents that have a two-part structure such as a problem part and a solution part. Documents of this genre include incident reports that typically involve description of events relating to a problem followed by those pertaining to the solution that was tried. Segmenting such documents into the component two parts would render them usable in knowledge reuse frameworks such as Case-Based Reasoning. This segmentation problem presents a hard case for traditional text segmentation due to the lexical inter-relatedness of the segments. We develop a two-part segmentation technique that can harness a corpus of similar documents to model the behavior of the two segments and their inter-relatedness using language models and translation models respectively. In particular, we use separate language models for the problem and solution segment types, whereas the inter-relatedness between segment types is modeled using an IBM Model 1 translation model. We model documents as being generated starting from the problem part that comprises of words sampled from the problem language model, followed by the solution part whose words are sampled either from the solution language model or from a translation model conditioned on the words already chosen in the problem part. We show, through an extensive set of experiments on real-world data, that our approach outperforms the state-of-the-art text segmentation algorithms in the accuracy of segmentation, and that such improved accuracy translates well to improved usability in Case-based Reasoning systems. We also analyze the robustness of our technique to varying amounts and types of noise and empirically illustrate that our technique is quite noise tolerant, and degrades gracefully with increasing amounts of noise.
我们考虑分割具有两部分结构的文本文档的问题,例如问题部分和解决方案部分。这种类型的文档包括事件报告,通常包括与问题相关的事件描述,然后是与所尝试的解决方案相关的描述。将这些文档分割成组件的两个部分将使它们在知识重用框架(如基于案例的推理)中可用。由于词法上的相关性,这一切分问题给传统的文本切分提出了一个难题。我们开发了一种两部分分割技术,该技术可以利用相似文档的语料库分别使用语言模型和翻译模型来建模两个部分的行为及其相互关系。特别是,我们为问题和解决方案片段类型使用单独的语言模型,而片段类型之间的相互关系是使用IBM Model 1翻译模型建模的。我们将文档建模为从问题部分(包括从问题语言模型中采样的单词)开始生成的文档,然后是解决方案部分(其单词从解决方案语言模型或以问题部分中已经选择的单词为条件的翻译模型中采样)。我们通过对真实世界数据的大量实验表明,我们的方法在分割的准确性方面优于最先进的文本分割算法,并且这种提高的准确性可以很好地转化为基于案例的推理系统中可用性的提高。我们还分析了我们的技术对不同数量和类型的噪声的鲁棒性,并通过经验说明我们的技术具有相当的噪声容忍性,并且随着噪声的增加而优雅地退化。
{"title":"Two-part segmentation of text documents","authors":"Deepak P, Karthik Venkat Ramanan, N. Wiratunga, Sadiq Sani","doi":"10.1145/2396761.2396862","DOIUrl":"https://doi.org/10.1145/2396761.2396862","url":null,"abstract":"We consider the problem of segmenting text documents that have a two-part structure such as a problem part and a solution part. Documents of this genre include incident reports that typically involve description of events relating to a problem followed by those pertaining to the solution that was tried. Segmenting such documents into the component two parts would render them usable in knowledge reuse frameworks such as Case-Based Reasoning. This segmentation problem presents a hard case for traditional text segmentation due to the lexical inter-relatedness of the segments. We develop a two-part segmentation technique that can harness a corpus of similar documents to model the behavior of the two segments and their inter-relatedness using language models and translation models respectively. In particular, we use separate language models for the problem and solution segment types, whereas the inter-relatedness between segment types is modeled using an IBM Model 1 translation model. We model documents as being generated starting from the problem part that comprises of words sampled from the problem language model, followed by the solution part whose words are sampled either from the solution language model or from a translation model conditioned on the words already chosen in the problem part. We show, through an extensive set of experiments on real-world data, that our approach outperforms the state-of-the-art text segmentation algorithms in the accuracy of segmentation, and that such improved accuracy translates well to improved usability in Case-based Reasoning systems. We also analyze the robustness of our technique to varying amounts and types of noise and empirically illustrate that our technique is quite noise tolerant, and degrades gracefully with increasing amounts of noise.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125508448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
A unified learning framework for auto face annotation by mining web facial images 基于web人脸图像挖掘的人脸自动标注统一学习框架
Dayong Wang, S. Hoi, Ying He
Auto face annotation plays an important role in many real-world multimedia information and knowledge management systems. Recently there is a surge of research interests in mining weakly-labeled facial images on the internet to tackle this long-standing research challenge in computer vision and image understanding. In this paper, we present a novel unified learning framework for face annotation by mining weakly labeled web facial images through interdisciplinary efforts of combining sparse feature representation, content-based image retrieval, transductive learning and inductive learning techniques. In particular, we first introduce a new search-based face annotation paradigm using transductive learning, and then propose an effective inductive learning scheme for training classification-based annotators from weakly labeled facial images, and finally unify both transductive and inductive learning approaches to maximize the learning efficacy. We conduct extensive experiments on a real-world web facial image database, in which encouraging results show that the proposed unified learning scheme outperforms the state-of-the-art approaches.
人脸自动标注在现实世界的多媒体信息和知识管理系统中起着重要的作用。最近,人们对挖掘互联网上的弱标记面部图像产生了浓厚的研究兴趣,以解决计算机视觉和图像理解领域长期存在的研究挑战。本文通过结合稀疏特征表示、基于内容的图像检索、转换学习和归纳学习技术的跨学科努力,提出了一种新的统一学习框架,用于挖掘弱标记的web面部图像。特别是,我们首先引入了一种新的基于搜索的人脸标注范式,然后提出了一种有效的归纳学习方案,用于从弱标记的人脸图像中训练基于分类的标注器,最后将转换和归纳学习方法统一起来,以最大限度地提高学习效果。我们在真实世界的网络面部图像数据库上进行了大量的实验,结果令人鼓舞,表明所提出的统一学习方案优于最先进的方法。
{"title":"A unified learning framework for auto face annotation by mining web facial images","authors":"Dayong Wang, S. Hoi, Ying He","doi":"10.1145/2396761.2398444","DOIUrl":"https://doi.org/10.1145/2396761.2398444","url":null,"abstract":"Auto face annotation plays an important role in many real-world multimedia information and knowledge management systems. Recently there is a surge of research interests in mining weakly-labeled facial images on the internet to tackle this long-standing research challenge in computer vision and image understanding. In this paper, we present a novel unified learning framework for face annotation by mining weakly labeled web facial images through interdisciplinary efforts of combining sparse feature representation, content-based image retrieval, transductive learning and inductive learning techniques. In particular, we first introduce a new search-based face annotation paradigm using transductive learning, and then propose an effective inductive learning scheme for training classification-based annotators from weakly labeled facial images, and finally unify both transductive and inductive learning approaches to maximize the learning efficacy. We conduct extensive experiments on a real-world web facial image database, in which encouraging results show that the proposed unified learning scheme outperforms the state-of-the-art approaches.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"156 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126827372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
Information preservation in static index pruning 静态索引剪枝中的信息保存
Ruey-Cheng Chen, Chia-Jung Lee, Chiung-min Tsai, J. Hsiang
We develop a new static index pruning criterion based on the notion of information preservation. This idea is motivated by the fact that model degeneration, as does static index pruning, inevitably reduces the predictive power of the resulting model. We model this loss in predictive power using conditional entropy and show that the decision in static index pruning can therefore be optimized to preserve information as much as possible. We evaluated the proposed approach on three different test corpora, and the result shows that our approach is comparable in retrieval performance to state-of-the-art methods. When efficiency is of concern, our method has some advantages over the reference methods and is therefore suggested in Web retrieval settings.
基于信息保存的概念,提出了一种新的静态索引修剪准则。这个想法源于这样一个事实,即模型退化和静态索引修剪一样,不可避免地会降低最终模型的预测能力。我们使用条件熵来模拟这种预测能力的损失,并表明静态索引修剪的决策因此可以优化以尽可能多地保留信息。我们在三个不同的测试语料库上评估了所提出的方法,结果表明我们的方法在检索性能上与最先进的方法相当。当考虑效率时,我们的方法比参考方法有一些优势,因此在Web检索设置中建议使用。
{"title":"Information preservation in static index pruning","authors":"Ruey-Cheng Chen, Chia-Jung Lee, Chiung-min Tsai, J. Hsiang","doi":"10.1145/2396761.2398673","DOIUrl":"https://doi.org/10.1145/2396761.2398673","url":null,"abstract":"We develop a new static index pruning criterion based on the notion of information preservation. This idea is motivated by the fact that model degeneration, as does static index pruning, inevitably reduces the predictive power of the resulting model. We model this loss in predictive power using conditional entropy and show that the decision in static index pruning can therefore be optimized to preserve information as much as possible. We evaluated the proposed approach on three different test corpora, and the result shows that our approach is comparable in retrieval performance to state-of-the-art methods. When efficiency is of concern, our method has some advantages over the reference methods and is therefore suggested in Web retrieval settings.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114892503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
Proceedings of the 21st ACM international conference on Information and knowledge management
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1