首页 > 最新文献

Proceedings of the 21st ACM international conference on Information and knowledge management最新文献

英文 中文
Improving document clustering using automated machine translation 使用自动机器翻译改进文档聚类
Xiang Wang, B. Qian, I. Davidson
With the development of statistical machine translation, we have ready-to-use tools that can translate documents from one language to many other languages. These translations provide different yet correlated views of the same set of documents. This gives rise to an intriguing question: can we use the extra information to achieve a better clustering of the documents? Some recent work on multiview clustering provided positive answers to this question. In this work, we propose an alternative approach to address this problem using the constrained clustering framework. Unlike traditional Must-Link and Cannot-Link constraints, the constraints generated from machine translation are dense yet noisy. We show how to incorporate this type of constraints by presenting two algorithms, one parametric and one non-parametric. Our algorithms are easy to implement, efficient, and can consistently improve the clustering of real data, namely the Reuters RCV1/RCV2 Multilingual Dataset. In contrast to existing multiview clustering algorithms, our technique does not need the compatibility or the conditional independence assumption, nor does it involve subtle parameter tuning.
随着统计机器翻译的发展,我们有了现成的工具,可以将文档从一种语言翻译成许多其他语言。这些翻译提供了同一组文档的不同但相关的视图。这就产生了一个有趣的问题:我们可以使用额外的信息来实现更好的文档聚类吗?最近关于多视图聚类的一些研究为这个问题提供了积极的答案。在这项工作中,我们提出了一种使用约束聚类框架来解决这个问题的替代方法。与传统的“必须链接”和“不能链接”约束不同,机器翻译生成的约束是密集但有噪声的。我们展示了如何结合这种类型的约束通过提出两种算法,一个参数和一个非参数。我们的算法易于实现,效率高,并且能够持续改进真实数据(即路透社RCV1/RCV2多语言数据集)的聚类。与现有的多视图聚类算法相比,我们的技术不需要兼容性或条件独立性假设,也不涉及精细的参数调优。
{"title":"Improving document clustering using automated machine translation","authors":"Xiang Wang, B. Qian, I. Davidson","doi":"10.1145/2396761.2396844","DOIUrl":"https://doi.org/10.1145/2396761.2396844","url":null,"abstract":"With the development of statistical machine translation, we have ready-to-use tools that can translate documents from one language to many other languages. These translations provide different yet correlated views of the same set of documents. This gives rise to an intriguing question: can we use the extra information to achieve a better clustering of the documents? Some recent work on multiview clustering provided positive answers to this question. In this work, we propose an alternative approach to address this problem using the constrained clustering framework. Unlike traditional Must-Link and Cannot-Link constraints, the constraints generated from machine translation are dense yet noisy. We show how to incorporate this type of constraints by presenting two algorithms, one parametric and one non-parametric. Our algorithms are easy to implement, efficient, and can consistently improve the clustering of real data, namely the Reuters RCV1/RCV2 Multilingual Dataset. In contrast to existing multiview clustering algorithms, our technique does not need the compatibility or the conditional independence assumption, nor does it involve subtle parameter tuning.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116247776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
MOUNA: mining opinions to unveil neglected arguments 挖掘观点,揭示被忽视的论点
Mouna Kacimi, J. Gamper
A query topic can be subjective involving a variety of opinions, judgments, arguments, and many other debatable aspects. Typically, search engines process queries independently from the nature of their topics using a relevance-based retrieval strategy. Hence, search results about subjective topics are often biased towards a specific view point or version. In this demo, we shall present MOUNA, a novel approach for opinion diversification. Given a query on a subjective topic, MOUNA ranks search results based on three scores: (1) relevance of documents, (2) semantic diversity to avoid redundancy and capture the different arguments used to discuss the query topic, and (3) sentiment diversity to cover a balanced set of documents having positive, negative, and neutral sentiments about the query topic. Moreover, MOUNA enhances the representation of search results with a summary of the different arguments and sentiments related to the query topic. Thus, the user can navigate through the results and explore the links between them. We provide an example scenario in this demonstration to illustrate the inadequacy of relevance-based techniques for searching subjective topics and highlight the innovative aspects of MOUNA. A video showing the demo can be found in http://www.youtube.com/user/mounakacimi/videos .
查询主题可以是主观的,涉及各种意见、判断、论证和许多其他有争议的方面。通常,搜索引擎使用基于相关性的检索策略独立于主题的性质来处理查询。因此,关于主观主题的搜索结果往往偏向于特定的观点或版本。在这个演示中,我们将介绍MOUNA,一种新颖的意见多样化方法。给定一个关于主观主题的查询,MOUNA基于三个分数对搜索结果进行排序:(1)文档的相关性,(2)语义多样性以避免冗余并捕获用于讨论查询主题的不同参数,以及(3)情感多样性以涵盖对查询主题具有积极,消极和中立情绪的文档的平衡集。此外,MOUNA通过总结与查询主题相关的不同参数和情感来增强搜索结果的表示。因此,用户可以浏览结果并探索它们之间的链接。在这个演示中,我们提供了一个示例场景来说明基于相关性的技术在搜索主观主题方面的不足,并突出了MOUNA的创新方面。可以在http://www.youtube.com/user/mounakacimi/videos找到演示视频。
{"title":"MOUNA: mining opinions to unveil neglected arguments","authors":"Mouna Kacimi, J. Gamper","doi":"10.1145/2396761.2398739","DOIUrl":"https://doi.org/10.1145/2396761.2398739","url":null,"abstract":"A query topic can be subjective involving a variety of opinions, judgments, arguments, and many other debatable aspects. Typically, search engines process queries independently from the nature of their topics using a relevance-based retrieval strategy. Hence, search results about subjective topics are often biased towards a specific view point or version. In this demo, we shall present MOUNA, a novel approach for opinion diversification. Given a query on a subjective topic, MOUNA ranks search results based on three scores: (1) relevance of documents, (2) semantic diversity to avoid redundancy and capture the different arguments used to discuss the query topic, and (3) sentiment diversity to cover a balanced set of documents having positive, negative, and neutral sentiments about the query topic. Moreover, MOUNA enhances the representation of search results with a summary of the different arguments and sentiments related to the query topic. Thus, the user can navigate through the results and explore the links between them. We provide an example scenario in this demonstration to illustrate the inadequacy of relevance-based techniques for searching subjective topics and highlight the innovative aspects of MOUNA. A video showing the demo can be found in http://www.youtube.com/user/mounakacimi/videos .","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116563086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Concavity in IR models 红外模型的凹凸性
S. Clinchant
We study the impact of concavity in IR models and propose to use a generalized logarithm function, the n-logarithm to weight words in documents. We extend the family of information based Information Retrieval (IR) models with this function. We show that that concavity is indeed an important property of IR models. Experiments conducted for IR tasks, Latent Semantic Indexing and Text Categorization show improvements.
我们研究了IR模型中凹凸度的影响,并提出使用广义对数函数n-对数来对文档中的单词进行加权。我们用这个函数扩展了基于信息的信息检索(IR)模型族。我们表明,凹性确实是红外模型的一个重要性质。在红外任务、潜在语义索引和文本分类方面进行了实验。
{"title":"Concavity in IR models","authors":"S. Clinchant","doi":"10.1145/2396761.2398686","DOIUrl":"https://doi.org/10.1145/2396761.2398686","url":null,"abstract":"We study the impact of concavity in IR models and propose to use a generalized logarithm function, the n-logarithm to weight words in documents. We extend the family of information based Information Retrieval (IR) models with this function. We show that that concavity is indeed an important property of IR models. Experiments conducted for IR tasks, Latent Semantic Indexing and Text Categorization show improvements.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122309732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Indexing uncertain spatio-temporal data 索引不确定的时空数据
Tobias Emrich, H. Kriegel, N. Mamoulis, M. Renz, Andreas Züfle
The advances in sensing and telecommunication technologies allow the collection and management of vast amounts of spatio-temporal data combining location and time information.Due to physical and resource limitations of data collection devices (e.g., RFID readers, GPS receivers and other sensors) data are typically collected only at discrete points of time. In-between these discrete time instances, the positions of tracked moving objects are uncertain. In this work, we propose novel approximation techniques in order to probabilistically bound the uncertain movement of objects; these techniques allow for efficient and effective filtering during query evaluation using an hierarchical index structure.To the best of our knowledge, this is the first approach that supports query evaluation on very large uncertain spatio-temporal databases, adhering to possible worlds semantics. We experimentally show that it accelerates the existing, scan-based approach by orders of magnitude.
传感和电信技术的进步使收集和管理结合位置和时间信息的大量时空数据成为可能。由于数据收集设备(例如RFID读取器、GPS接收器和其他传感器)的物理和资源限制,通常只能在离散的时间点收集数据。在这些离散的时间实例之间,被跟踪的运动物体的位置是不确定的。在这项工作中,我们提出了新的近似技术,以便概率地约束物体的不确定运动;这些技术允许在使用分层索引结构的查询求值期间进行高效的过滤。据我们所知,这是第一种支持在非常大的不确定时空数据库上进行查询评估的方法,坚持可能世界语义。我们通过实验证明,它将现有的基于扫描的方法提高了几个数量级。
{"title":"Indexing uncertain spatio-temporal data","authors":"Tobias Emrich, H. Kriegel, N. Mamoulis, M. Renz, Andreas Züfle","doi":"10.1145/2396761.2396813","DOIUrl":"https://doi.org/10.1145/2396761.2396813","url":null,"abstract":"The advances in sensing and telecommunication technologies allow the collection and management of vast amounts of spatio-temporal data combining location and time information.Due to physical and resource limitations of data collection devices (e.g., RFID readers, GPS receivers and other sensors) data are typically collected only at discrete points of time. In-between these discrete time instances, the positions of tracked moving objects are uncertain. In this work, we propose novel approximation techniques in order to probabilistically bound the uncertain movement of objects; these techniques allow for efficient and effective filtering during query evaluation using an hierarchical index structure.To the best of our knowledge, this is the first approach that supports query evaluation on very large uncertain spatio-temporal databases, adhering to possible worlds semantics. We experimentally show that it accelerates the existing, scan-based approach by orders of magnitude.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122567383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
A tag-centric discriminative model for web objects classification 一个以标签为中心的web对象分类判别模型
Lina Yao, Quan Z. Sheng
This paper studies web object classification problem with the novel exploration of social tags. More and more web objects are increasingly annotated with human interpretable labels (i.e., tags), which can be considered as an auxiliary attribute to assist the object classification. Automatically classifying web objects into manageable semantic categories has long been a fundamental pre-process for indexing, browsing, searching, and mining heterogeneous web objects. However, such heterogeneous web objects often suffer from a lack of easy-extractable and uniform descriptive features. In this paper, we propose a discriminative tag-centric model for web object classification by jointly modeling the objects category labels and their corresponding social tags and un-coding the relevance among social tags. Our approach is based on recent techniques for learning large-scale discriminative models. We conduct experiments to validate our approach using real-life data. The results show the feasibility and good performance of our approach.
本文通过对社会标签的新颖探索来研究web对象分类问题。越来越多的web对象被标注为人类可解释的标签(即标签),可以将其视为辅助对象分类的辅助属性。自动将web对象分类为可管理的语义类别一直是索引、浏览、搜索和挖掘异构web对象的基本预处理。然而,这种异构的web对象往往缺乏易于提取和统一的描述性特征。本文提出了一种以判别标签为中心的web对象分类模型,该模型通过对对象类别标签及其对应的社会标签进行联合建模,并对社会标签之间的相关性进行反编码。我们的方法是基于学习大规模判别模型的最新技术。我们利用现实生活中的数据进行实验来验证我们的方法。结果表明了该方法的可行性和良好的性能。
{"title":"A tag-centric discriminative model for web objects classification","authors":"Lina Yao, Quan Z. Sheng","doi":"10.1145/2396761.2398612","DOIUrl":"https://doi.org/10.1145/2396761.2398612","url":null,"abstract":"This paper studies web object classification problem with the novel exploration of social tags. More and more web objects are increasingly annotated with human interpretable labels (i.e., tags), which can be considered as an auxiliary attribute to assist the object classification. Automatically classifying web objects into manageable semantic categories has long been a fundamental pre-process for indexing, browsing, searching, and mining heterogeneous web objects. However, such heterogeneous web objects often suffer from a lack of easy-extractable and uniform descriptive features. In this paper, we propose a discriminative tag-centric model for web object classification by jointly modeling the objects category labels and their corresponding social tags and un-coding the relevance among social tags. Our approach is based on recent techniques for learning large-scale discriminative models. We conduct experiments to validate our approach using real-life data. The results show the feasibility and good performance of our approach.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122688216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Unsupervised discovery of opposing opinion networks from forum discussions 从论坛讨论中无监督地发现反对意见网络
Yue Lu, Hongning Wang, ChengXiang Zhai, D. Roth
With more and more people freely express opinions as well as actively interact with each other in discussion threads, online forums are becoming a gold mine with rich information about people's opinions and social behaviors. In this paper, we study an interesting new problem of automatically discovering opposing opinion networks of users from forum discussions, which are subset of users who are strongly against each other on some topic. Toward this goal, we propose to use signals from both textual content (e.g., who says what) and social interactions (e.g., who talks to whom) which are both abundant in online forums. We also design an optimization formulation to combine all the signals in an unsupervised way. We created a data set by manually annotating forum data on five controversial topics and our experimental results show that the proposed optimization method outperforms several baselines and existing approaches, demonstrating the power of combining both text analysis and social network analysis in analyzing and generating the opposing opinion networks.
随着越来越多的人在讨论区自由表达意见,积极互动,网络论坛正成为人们观点和社会行为信息丰富的金矿。在本文中,我们研究了一个有趣的新问题,即从论坛讨论中自动发现用户的对立意见网络,这些用户是在某个话题上强烈反对对方的用户的子集。为了实现这一目标,我们建议使用来自文本内容(例如,谁说了什么)和社会互动(例如,谁与谁交谈)的信号,这些信号在在线论坛中都很丰富。我们还设计了一个优化公式,以无监督的方式组合所有信号。我们通过手动标注五个争议话题的论坛数据创建了一个数据集,我们的实验结果表明,所提出的优化方法优于几种基线和现有方法,展示了结合文本分析和社会网络分析在分析和生成对立意见网络方面的强大功能。
{"title":"Unsupervised discovery of opposing opinion networks from forum discussions","authors":"Yue Lu, Hongning Wang, ChengXiang Zhai, D. Roth","doi":"10.1145/2396761.2398489","DOIUrl":"https://doi.org/10.1145/2396761.2398489","url":null,"abstract":"With more and more people freely express opinions as well as actively interact with each other in discussion threads, online forums are becoming a gold mine with rich information about people's opinions and social behaviors. In this paper, we study an interesting new problem of automatically discovering opposing opinion networks of users from forum discussions, which are subset of users who are strongly against each other on some topic. Toward this goal, we propose to use signals from both textual content (e.g., who says what) and social interactions (e.g., who talks to whom) which are both abundant in online forums. We also design an optimization formulation to combine all the signals in an unsupervised way. We created a data set by manually annotating forum data on five controversial topics and our experimental results show that the proposed optimization method outperforms several baselines and existing approaches, demonstrating the power of combining both text analysis and social network analysis in analyzing and generating the opposing opinion networks.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122509966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
CGStream: continuous correlated graph query for data streams CGStream:数据流的连续关联图查询
Shirui Pan, Xingquan Zhu
In this paper, we propose to query correlated graph in a data stream scenario, where given a query graph q an algorithm is required to retrieve all the subgraphs whose Pearson's correlation coefficients with q are greater than a threshold Θ over some graph data flowing in a stream fashion. Due to the dynamic changing nature of the stream data and the inherent complexity of the graph query process, treating graph streams as static datasets is computationally infeasible or ineffective. In the paper, we propose a novel algorithm, CGStream, to identify correlated graphs from data stream, by using a sliding window which covers a number of consecutive batches of stream data records. Our theme is to regard stream query as the traversing along a data stream and the query is achieved at a number of outlooks over the data stream. For each outlook, we derive a lower frequency bound to mine a set of frequent subgraph candidates, where the lower bound guarantees that no pattern is missing from the current outlook to the next outlook. On top of that, we derive an upper correlation bound and a heuristic rule to prune the candidate size, which helps reduce the computation cost at each outlook. Experimental results demonstrate that the proposed algorithm is several times, or even an order of magnitude, more efficient than the straightforward algorithm. Meanwhile, our algorithm achieves good performance in terms of query precision.
在本文中,我们提出在数据流场景中查询相关图,其中给定一个查询图q,需要一种算法来检索以流方式流动的一些图数据中Pearson与q的相关系数大于阈值Θ的所有子图。由于流数据的动态变化性质和图查询过程的固有复杂性,将图流视为静态数据集在计算上是不可行的或无效的。在本文中,我们提出了一种新的算法CGStream,该算法通过滑动窗口覆盖多个连续批次的流数据记录,从数据流中识别相关图。我们的主题是将流查询视为沿着数据流的遍历,并且查询是在数据流的多个视图上实现的。对于每个展望,我们推导出一个较低的频率界限来挖掘一组频繁子图候选,其中下界保证从当前展望到下一个展望没有模式缺失。在此基础上,我们推导了一个上相关界和一个启发式规则来修剪候选大小,这有助于减少每个前景的计算成本。实验结果表明,该算法比直接算法的效率提高了几倍,甚至一个数量级。同时,我们的算法在查询精度方面取得了良好的性能。
{"title":"CGStream: continuous correlated graph query for data streams","authors":"Shirui Pan, Xingquan Zhu","doi":"10.1145/2396761.2398419","DOIUrl":"https://doi.org/10.1145/2396761.2398419","url":null,"abstract":"In this paper, we propose to query correlated graph in a data stream scenario, where given a query graph q an algorithm is required to retrieve all the subgraphs whose Pearson's correlation coefficients with q are greater than a threshold Θ over some graph data flowing in a stream fashion. Due to the dynamic changing nature of the stream data and the inherent complexity of the graph query process, treating graph streams as static datasets is computationally infeasible or ineffective. In the paper, we propose a novel algorithm, CGStream, to identify correlated graphs from data stream, by using a sliding window which covers a number of consecutive batches of stream data records. Our theme is to regard stream query as the traversing along a data stream and the query is achieved at a number of outlooks over the data stream. For each outlook, we derive a lower frequency bound to mine a set of frequent subgraph candidates, where the lower bound guarantees that no pattern is missing from the current outlook to the next outlook. On top of that, we derive an upper correlation bound and a heuristic rule to prune the candidate size, which helps reduce the computation cost at each outlook. Experimental results demonstrate that the proposed algorithm is several times, or even an order of magnitude, more efficient than the straightforward algorithm. Meanwhile, our algorithm achieves good performance in terms of query precision.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122593604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
What is the IQ of your data transformation system? 你的数据转换系统的IQ是多少?
G. Mecca, Paolo Papotti, Salvatore Raunich, Donatello Santoro
Mapping and translating data across different representations is a crucial problem in information systems. Many formalisms and tools are currently used for this purpose, to the point that developers typically face a difficult question: "what is the right tool for my translation task?" In this paper, we introduce several techniques that contribute to answer this question. Among these, a fairly general definition of a data transformation system, a new and very efficient similarity measure to evaluate the outputs produced by such a system, and a metric to estimate user efforts. Based on these techniques, we are able to compare a wide range of systems on many translation tasks, to gain interesting insights about their effectiveness, and, ultimately, about their "intelligence".
映射和转换跨不同表示的数据是信息系统中的一个关键问题。目前有许多形式和工具用于此目的,以至于开发人员通常面临一个难题:“我的翻译任务的正确工具是什么?”在本文中,我们将介绍几种有助于回答这个问题的技术。其中,一个相当一般的数据转换系统的定义,一个新的和非常有效的相似性度量来评估这样一个系统产生的输出,以及一个度量来估计用户的努力。基于这些技术,我们能够在许多翻译任务中比较各种各样的系统,以获得关于它们的有效性的有趣见解,并最终了解它们的“智能”。
{"title":"What is the IQ of your data transformation system?","authors":"G. Mecca, Paolo Papotti, Salvatore Raunich, Donatello Santoro","doi":"10.1145/2396761.2396872","DOIUrl":"https://doi.org/10.1145/2396761.2396872","url":null,"abstract":"Mapping and translating data across different representations is a crucial problem in information systems. Many formalisms and tools are currently used for this purpose, to the point that developers typically face a difficult question: \"what is the right tool for my translation task?\" In this paper, we introduce several techniques that contribute to answer this question. Among these, a fairly general definition of a data transformation system, a new and very efficient similarity measure to evaluate the outputs produced by such a system, and a metric to estimate user efforts. Based on these techniques, we are able to compare a wide range of systems on many translation tasks, to gain interesting insights about their effectiveness, and, ultimately, about their \"intelligence\".","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122921787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
From sBoW to dCoT marginalized encoders for text representation 从sBoW到dCoT,用于文本表示的边缘编码器
Z. Xu, Minmin Chen, Kilian Q. Weinberger, Fei Sha
In text mining, information retrieval, and machine learning, text documents are commonly represented through variants of sparse Bag of Words (sBoW) vectors (e.g. TF-IDF [1]). Although simple and intuitive, sBoW style representations suffer from their inherent over-sparsity and fail to capture word-level synonymy and polysemy. Especially when labeled data is limited (e.g. in document classification), or the text documents are short (e.g. emails or abstracts), many features are rarely observed within the training corpus. This leads to overfitting and reduced generalization accuracy. In this paper we propose Dense Cohort of Terms (dCoT), an unsupervised algorithm to learn improved sBoW document features. dCoT explicitly models absent words by removing and reconstructing random sub-sets of words in the unlabeled corpus. With this approach, dCoT learns to reconstruct frequent words from co-occurring infrequent words and maps the high dimensional sparse sBoW vectors into a low-dimensional dense representation. We show that the feature removal can be marginalized out and that the reconstruction can be solved for in closed-form. We demonstrate empirically, on several benchmark datasets, that dCoT features significantly improve the classification accuracy across several document classification tasks.
在文本挖掘、信息检索和机器学习中,文本文档通常通过稀疏词袋(sBoW)向量的变体来表示(例如TF-IDF[1])。虽然简单直观,但sBoW风格表示存在固有的过度稀疏性,无法捕获词级同义和多义。特别是当标记数据有限(例如在文档分类中),或者文本文档很短(例如电子邮件或摘要)时,在训练语料库中很少观察到许多特征。这会导致过拟合和降低泛化精度。在本文中,我们提出了密集的术语队列(dCoT),一种无监督的算法来学习改进的sBoW文档特征。dCoT通过删除和重建未标记语料库中的随机词子集来显式地建模缺失词。通过这种方法,dCoT学习从同时出现的不频繁单词中重构频繁单词,并将高维稀疏sBoW向量映射为低维密集表示。我们证明了特征去除可以被边缘化,重构可以以封闭形式求解。我们在几个基准数据集上通过经验证明,dCoT特征显著提高了多个文档分类任务的分类精度。
{"title":"From sBoW to dCoT marginalized encoders for text representation","authors":"Z. Xu, Minmin Chen, Kilian Q. Weinberger, Fei Sha","doi":"10.1145/2396761.2398536","DOIUrl":"https://doi.org/10.1145/2396761.2398536","url":null,"abstract":"In text mining, information retrieval, and machine learning, text documents are commonly represented through variants of sparse Bag of Words (sBoW) vectors (e.g. TF-IDF [1]). Although simple and intuitive, sBoW style representations suffer from their inherent over-sparsity and fail to capture word-level synonymy and polysemy. Especially when labeled data is limited (e.g. in document classification), or the text documents are short (e.g. emails or abstracts), many features are rarely observed within the training corpus. This leads to overfitting and reduced generalization accuracy. In this paper we propose Dense Cohort of Terms (dCoT), an unsupervised algorithm to learn improved sBoW document features. dCoT explicitly models absent words by removing and reconstructing random sub-sets of words in the unlabeled corpus. With this approach, dCoT learns to reconstruct frequent words from co-occurring infrequent words and maps the high dimensional sparse sBoW vectors into a low-dimensional dense representation. We show that the feature removal can be marginalized out and that the reconstruction can be solved for in closed-form. We demonstrate empirically, on several benchmark datasets, that dCoT features significantly improve the classification accuracy across several document classification tasks.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123033017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
CrowdTiles: presenting crowd-based information for event-driven information needs CrowdTiles:呈现基于人群的信息以满足事件驱动的信息需求
S. Whiting, K. Zhou, J. Jose, Omar Alonso, Teerapong Leelanupab
Time plays a central role in many web search information needs relating to recent events. For recency queries where fresh information is most desirable, there is likely to be a great deal of highly-relevant information created very recently by crowds of people across the world, particularly on platforms such as Wikipedia and Twitter. With so many users, mainstream events are often very quickly reflected in these sources. The English Wikipedia encyclopedia consists of a vast collection of user-edited articles covering a range of topics. During events, users collaboratively create and edit existing articles in near real-time. Simultaneously, users on Twitter disseminate and discuss event details, with a small number of users becoming influential for the topic. In this demo, we propose a novel approach to presenting a summary of new information and users related to recent or ongoing events associated with the user's search topic, therefore aiding most recent information discovery. We outline methods to detect search topics which are driven by events, identify and extract changing Wikipedia article passages and find influential Twitter users. Using these, we provide a system which displays familiar tiles in search results to present recent changes in the event-related Wikipedia articles, as well as Twitter users who have tweeted recent relevant information about the event topics.
时间在许多与近期事件相关的网络搜索信息需求中起着中心作用。对于最需要新鲜信息的近期查询,可能会有大量由世界各地的人群最近创建的高度相关的信息,特别是在维基百科和Twitter等平台上。有了这么多的用户,主流事件通常会很快反映在这些资源中。英文维基百科全书包含大量用户编辑的文章,涵盖了一系列主题。在事件期间,用户可以近乎实时地协作创建和编辑现有文章。同时,Twitter上的用户传播和讨论事件细节,少数用户对该话题具有影响力。在这个演示中,我们提出了一种新颖的方法来显示与用户搜索主题相关的最近或正在发生的事件相关的新信息和用户的摘要,从而帮助发现最新的信息。我们概述了检测由事件驱动的搜索主题的方法,识别和提取不断变化的维基百科文章段落,并找到有影响力的Twitter用户。使用这些,我们提供了一个系统,该系统在搜索结果中显示熟悉的磁贴,以显示与事件相关的维基百科文章的最新变化,以及最近发布了有关事件主题的相关信息的Twitter用户。
{"title":"CrowdTiles: presenting crowd-based information for event-driven information needs","authors":"S. Whiting, K. Zhou, J. Jose, Omar Alonso, Teerapong Leelanupab","doi":"10.1145/2396761.2398731","DOIUrl":"https://doi.org/10.1145/2396761.2398731","url":null,"abstract":"Time plays a central role in many web search information needs relating to recent events. For recency queries where fresh information is most desirable, there is likely to be a great deal of highly-relevant information created very recently by crowds of people across the world, particularly on platforms such as Wikipedia and Twitter. With so many users, mainstream events are often very quickly reflected in these sources. The English Wikipedia encyclopedia consists of a vast collection of user-edited articles covering a range of topics. During events, users collaboratively create and edit existing articles in near real-time. Simultaneously, users on Twitter disseminate and discuss event details, with a small number of users becoming influential for the topic. In this demo, we propose a novel approach to presenting a summary of new information and users related to recent or ongoing events associated with the user's search topic, therefore aiding most recent information discovery. We outline methods to detect search topics which are driven by events, identify and extract changing Wikipedia article passages and find influential Twitter users. Using these, we provide a system which displays familiar tiles in search results to present recent changes in the event-related Wikipedia articles, as well as Twitter users who have tweeted recent relevant information about the event topics.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":" 14","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114060637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
期刊
Proceedings of the 21st ACM international conference on Information and knowledge management
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1