首页 > 最新文献

Proceedings of the 22nd ACM international conference on Information & Knowledge Management最新文献

英文 中文
Efficient two-party private blocking based on sorted nearest neighborhood clustering 基于排序近邻聚类的高效双方私有阻塞
Dinusha Vatsalan, P. Christen, Vassilios S. Verykios
Integrating data from diverse sources with the aim to identify similar records that refer to the same real-world entities without compromising privacy of these entities is an emerging research problem in various domains. This problem is known as privacy-preserving record linkage (PPRL). Scalability of PPRL is a main challenge due to growing data size in real-world applications. Private blocking techniques have been used in PPRL to address this challenge by reducing the number of record pair comparisons that need to be conducted. Many of these private blocking techniques require a trusted third party to perform the blocking. One main threat with three-party solutions is the collusion between parties to identify the private data of another party. We introduce a novel two-party private blocking technique for PPRL based on sorted nearest neighborhood clustering. Privacy is addressed by a combination of the privacy techniques k-anonymous clustering and public reference values. Experiments conducted on two real-world databases validate that our approach is scalable to large databases and effective in generating candidate record pairs that correspond to true matches, while preserving k-anonymous privacy characteristics. Our approach also performs equal or superior compared to three other state-of-the-art private blocking techniques in terms of scalability, blocking quality, and privacy. It can achieve private blocking up-to two magnitudes faster than other state-of-the art private blocking approaches.
整合来自不同来源的数据,目的是在不损害这些实体隐私的情况下识别引用相同现实世界实体的类似记录,这是各个领域的一个新兴研究问题。这个问题被称为隐私保护记录链接(PPRL)。由于实际应用程序中的数据量不断增长,PPRL的可伸缩性是一个主要挑战。PPRL中使用了私有阻塞技术,通过减少需要进行的记录对比较的数量来解决这一挑战。许多这些私有阻止技术都需要可信的第三方来执行阻止。三方解决方案的一个主要威胁是各方相互勾结,以识别另一方的私人数据。提出了一种新的基于最近邻排序聚类的PPRL两方私有阻塞技术。隐私是通过隐私技术k-匿名聚类和公共参考值的组合来解决的。在两个真实数据库上进行的实验验证了我们的方法可扩展到大型数据库,并有效地生成与真实匹配对应的候选记录对,同时保留k-匿名隐私特征。与其他三种最先进的私有拦截技术相比,我们的方法在可扩展性、拦截质量和隐私性方面也表现得相同或更好。它可以实现私有阻塞,比其他最先进的私有阻塞方法快两个数量级。
{"title":"Efficient two-party private blocking based on sorted nearest neighborhood clustering","authors":"Dinusha Vatsalan, P. Christen, Vassilios S. Verykios","doi":"10.1145/2505515.2505757","DOIUrl":"https://doi.org/10.1145/2505515.2505757","url":null,"abstract":"Integrating data from diverse sources with the aim to identify similar records that refer to the same real-world entities without compromising privacy of these entities is an emerging research problem in various domains. This problem is known as privacy-preserving record linkage (PPRL). Scalability of PPRL is a main challenge due to growing data size in real-world applications. Private blocking techniques have been used in PPRL to address this challenge by reducing the number of record pair comparisons that need to be conducted. Many of these private blocking techniques require a trusted third party to perform the blocking. One main threat with three-party solutions is the collusion between parties to identify the private data of another party. We introduce a novel two-party private blocking technique for PPRL based on sorted nearest neighborhood clustering. Privacy is addressed by a combination of the privacy techniques k-anonymous clustering and public reference values. Experiments conducted on two real-world databases validate that our approach is scalable to large databases and effective in generating candidate record pairs that correspond to true matches, while preserving k-anonymous privacy characteristics. Our approach also performs equal or superior compared to three other state-of-the-art private blocking techniques in terms of scalability, blocking quality, and privacy. It can achieve private blocking up-to two magnitudes faster than other state-of-the art private blocking approaches.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"92 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78056721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Efficient parsing-based search over structured data 高效的基于解析的结构化数据搜索
Aditya G. Parameswaran, R. Kaushik, A. Arasu
Parsing-based search, i.e., parsing keyword search queries using grammars, is often used to override the traditional "bag-of-words'" semantics in web search and enterprise search scenarios. Compared to the "bag-of-words" semantics, the parsing-based semantics is richer and more customizable. While a formalism for parsing-based semantics for keyword search has been proposed in prior work and ad-hoc implementations exist, the problem of designing efficient algorithms to support the semantics is largely unstudied. In this paper, we present a suite of efficient algorithms and auxiliary indexes for this problem. Our algorithms work for a broad classes of grammars used in practice, and cover a variety of database matching functions (set- and substring-containment, approximate and exact equality) and scoring functions (to filter and rank different parses). We formally analyze the time complexity of our algorithms and provide an empirical evaluation over real-world data to show that our algorithms scale well with the size of the database and grammar.
基于解析的搜索,即使用语法解析关键字搜索查询,通常用于在web搜索和企业搜索场景中覆盖传统的“词袋”语义。与“词袋”语义相比,基于解析的语义更丰富,更可定制。虽然在之前的工作中已经提出了一种基于解析的关键字搜索语义的形式,并且存在特定的实现,但设计有效的算法来支持语义的问题在很大程度上没有研究。在本文中,我们提出了一套有效的算法和辅助指标。我们的算法适用于实践中使用的各种语法,涵盖了各种数据库匹配函数(集合和子字符串包含,近似和精确相等)和评分函数(过滤和排序不同的解析)。我们正式分析了算法的时间复杂性,并提供了对现实世界数据的经验评估,以表明我们的算法可以很好地随数据库和语法的大小进行扩展。
{"title":"Efficient parsing-based search over structured data","authors":"Aditya G. Parameswaran, R. Kaushik, A. Arasu","doi":"10.1145/2505515.2505764","DOIUrl":"https://doi.org/10.1145/2505515.2505764","url":null,"abstract":"Parsing-based search, i.e., parsing keyword search queries using grammars, is often used to override the traditional \"bag-of-words'\" semantics in web search and enterprise search scenarios. Compared to the \"bag-of-words\" semantics, the parsing-based semantics is richer and more customizable. While a formalism for parsing-based semantics for keyword search has been proposed in prior work and ad-hoc implementations exist, the problem of designing efficient algorithms to support the semantics is largely unstudied. In this paper, we present a suite of efficient algorithms and auxiliary indexes for this problem. Our algorithms work for a broad classes of grammars used in practice, and cover a variety of database matching functions (set- and substring-containment, approximate and exact equality) and scoring functions (to filter and rank different parses). We formally analyze the time complexity of our algorithms and provide an empirical evaluation over real-world data to show that our algorithms scale well with the size of the database and grammar.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"197 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72821445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Location prediction in social media based on tie strength 基于纽带强度的社交媒体位置预测
Jeffrey McGee, James Caverlee, Zhiyuan Cheng
We propose a novel network-based approach for location estimation in social media that integrates evidence of the social tie strength between users for improved location estimation. Concretely, we propose a location estimator -- FriendlyLocation -- that leverages the relationship between the strength of the tie between a pair of users, and the distance between the pair. Based on an examination of over 100 million geo-encoded tweets and 73 million Twitter user profiles, we identify several factors such as the number of followers and how the users interact that can strongly reveal the distance between a pair of users. We use these factors to train a decision tree to distinguish between pairs of users who are likely to live nearby and pairs of users who are likely to live in different areas. We use the results of this decision tree as the input to a maximum likelihood estimator to predict a user's location. We find that this proposed method significantly improves the results of location estimation relative to a state-of-the-art technique. Our system reduces the average error distance for 80% of Twitter users from 40 miles to 21 miles using only information from the user's friends and friends-of-friends, which has great significance for augmenting traditional social media and enriching location-based services with more refined and accurate location estimates.
我们提出了一种新的基于网络的社交媒体位置估计方法,该方法集成了用户之间社会联系强度的证据,以改进位置估计。具体地说,我们提出了一个位置估计器——FriendlyLocation——它利用了一对用户之间的联系强度和这对用户之间的距离之间的关系。基于对超过1亿条地理编码推文和7300万Twitter用户资料的检查,我们确定了几个因素,如关注者数量和用户互动方式,这些因素可以强烈地揭示一对用户之间的距离。我们使用这些因素来训练一个决策树来区分可能住在附近的用户对和可能住在不同区域的用户对。我们使用决策树的结果作为最大似然估计器的输入来预测用户的位置。我们发现,相对于最先进的技术,该方法显著改善了位置估计的结果。我们的系统仅使用来自用户朋友和朋友的朋友的信息,就将80%的Twitter用户的平均误差距离从40英里减少到21英里,这对于增强传统社交媒体和丰富基于位置的服务具有更精细和准确的位置估计具有重要意义。
{"title":"Location prediction in social media based on tie strength","authors":"Jeffrey McGee, James Caverlee, Zhiyuan Cheng","doi":"10.1145/2505515.2505544","DOIUrl":"https://doi.org/10.1145/2505515.2505544","url":null,"abstract":"We propose a novel network-based approach for location estimation in social media that integrates evidence of the social tie strength between users for improved location estimation. Concretely, we propose a location estimator -- FriendlyLocation -- that leverages the relationship between the strength of the tie between a pair of users, and the distance between the pair. Based on an examination of over 100 million geo-encoded tweets and 73 million Twitter user profiles, we identify several factors such as the number of followers and how the users interact that can strongly reveal the distance between a pair of users. We use these factors to train a decision tree to distinguish between pairs of users who are likely to live nearby and pairs of users who are likely to live in different areas. We use the results of this decision tree as the input to a maximum likelihood estimator to predict a user's location. We find that this proposed method significantly improves the results of location estimation relative to a state-of-the-art technique. Our system reduces the average error distance for 80% of Twitter users from 40 miles to 21 miles using only information from the user's friends and friends-of-friends, which has great significance for augmenting traditional social media and enriching location-based services with more refined and accurate location estimates.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"56 4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76808050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 160
Query matching for report recommendation 查询匹配报告推荐
Veronika Thost, Konrad Voigt, Daniel Schuster
Today, reporting is an essential part of everyday business life. But the preparation of complex Business Intelligence data by formulating relevant queries and presenting them in meaningful visualizations, so-called reports, is a challenging task for non-expert database users. To support these users with report creation, we leverage existing queries and present a system for query recommendation in a reporting environment, which is based on query matching. Targeting at large-scale, real-world reporting scenarios, we propose a scalable, index-based query matching approach. Moreover, schema matching is applied for a more fine-grained, structural comparison of the queries. In addition to interactively providing content-based query recommendations of good quality, the system works independent of particular data sources or query languages. We evaluate our system with an empirical data set and show that it achieves an F1-Measure of 0.56 and outperforms the approaches applied by state-of-the-art reporting tools (e.g., keyword search) by up to 30%.
今天,报告是日常商业生活的重要组成部分。但是,通过制定相关查询并以有意义的可视化(所谓的报告)形式呈现复杂的商业智能数据,这对于非专家数据库用户来说是一项具有挑战性的任务。为了支持这些用户创建报表,我们利用现有查询,并在报表环境中提供一个基于查询匹配的查询推荐系统。针对大规模的、真实的报告场景,我们提出了一种可扩展的、基于索引的查询匹配方法。此外,模式匹配用于对查询进行更细粒度的结构比较。除了以交互方式提供高质量的基于内容的查询建议外,该系统还独立于特定的数据源或查询语言工作。我们用经验数据集评估我们的系统,并表明它达到了0.56的F1-Measure,并且比最先进的报告工具(例如,关键字搜索)应用的方法高出30%。
{"title":"Query matching for report recommendation","authors":"Veronika Thost, Konrad Voigt, Daniel Schuster","doi":"10.1145/2505515.2505562","DOIUrl":"https://doi.org/10.1145/2505515.2505562","url":null,"abstract":"Today, reporting is an essential part of everyday business life. But the preparation of complex Business Intelligence data by formulating relevant queries and presenting them in meaningful visualizations, so-called reports, is a challenging task for non-expert database users. To support these users with report creation, we leverage existing queries and present a system for query recommendation in a reporting environment, which is based on query matching. Targeting at large-scale, real-world reporting scenarios, we propose a scalable, index-based query matching approach. Moreover, schema matching is applied for a more fine-grained, structural comparison of the queries. In addition to interactively providing content-based query recommendations of good quality, the system works independent of particular data sources or query languages. We evaluate our system with an empirical data set and show that it achieves an F1-Measure of 0.56 and outperforms the approaches applied by state-of-the-art reporting tools (e.g., keyword search) by up to 30%.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78659540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
PAGE: a partition aware graph computation engine PAGE:分区感知图计算引擎
Yingxia Shao, Junjie Yao, B. Cui, Lin Ma
Graph partitioning is one of the key components in parallel graph computation, and the partition quality significantly affects the overall computing performance. In the existing graph computing systems, ``good'' partition schemes are preferred as they have smaller edge cut ratio and hence reduce the communication cost among working nodes. However, in an empirical study on Giraph[1], we found that the performance over well partitioned graph might be even two times worse than simple partitions. The cause is that the local message processing cost in graph computing systems may surpass the communication cost in several cases. In this paper, we analyse the cost of parallel graph computing systems as well as the relationship between the cost and underlying graph partitioning. Based on these observation, we propose a novel Partition Aware Graph computation Engine named PAGE. PAGE is equipped with two newly designed modules, i.e., the communication module with a dual concurrent message processor, and a partition aware one to monitor the system's status. The monitored information can be utilized to dynamically adjust the concurrency of dual concurrent message processor with a novel Dynamic Concurrency Control Model (DCCM). The DCCM applies several heuristic rules to determine the optimal concurrency for the message processor. We have implemented a prototype of PAGE and conducted extensive studies on a moderate size of cluster. The experimental results clearly demonstrate the PAGE's robustness under different graph partition qualities and show its advantages over existing systems with up to 59% improvement.
图分区是并行图计算的关键组成部分之一,分区质量对整体计算性能影响很大。在现有的图计算系统中,“好的”划分方案是首选,因为它们具有较小的切边率,从而减少工作节点之间的通信成本。然而,在对Giraph[1]的实证研究中,我们发现分区良好的图的性能甚至可能比简单分区差两倍。其原因是在某些情况下,图计算系统的本地消息处理成本可能超过通信成本。本文分析了并行图计算系统的开销,以及开销与底层图划分之间的关系。基于这些观察,我们提出了一种新的分区感知图计算引擎PAGE。PAGE配备了两个新设计的模块,即具有双并发消息处理器的通信模块和用于监控系统状态的分区感知模块。采用一种新的动态并发控制模型(DCCM),利用监测到的信息对双并发消息处理器的并发性进行动态调整。DCCM应用几个启发式规则来确定消息处理器的最佳并发性。我们已经实现了PAGE的原型,并在中等规模的集群上进行了广泛的研究。实验结果清楚地证明了PAGE在不同图划分质量下的鲁棒性,并显示了它比现有系统的优势,提高了59%。
{"title":"PAGE: a partition aware graph computation engine","authors":"Yingxia Shao, Junjie Yao, B. Cui, Lin Ma","doi":"10.1145/2505515.2505617","DOIUrl":"https://doi.org/10.1145/2505515.2505617","url":null,"abstract":"Graph partitioning is one of the key components in parallel graph computation, and the partition quality significantly affects the overall computing performance. In the existing graph computing systems, ``good'' partition schemes are preferred as they have smaller edge cut ratio and hence reduce the communication cost among working nodes. However, in an empirical study on Giraph[1], we found that the performance over well partitioned graph might be even two times worse than simple partitions. The cause is that the local message processing cost in graph computing systems may surpass the communication cost in several cases. In this paper, we analyse the cost of parallel graph computing systems as well as the relationship between the cost and underlying graph partitioning. Based on these observation, we propose a novel Partition Aware Graph computation Engine named PAGE. PAGE is equipped with two newly designed modules, i.e., the communication module with a dual concurrent message processor, and a partition aware one to monitor the system's status. The monitored information can be utilized to dynamically adjust the concurrency of dual concurrent message processor with a novel Dynamic Concurrency Control Model (DCCM). The DCCM applies several heuristic rules to determine the optimal concurrency for the message processor. We have implemented a prototype of PAGE and conducted extensive studies on a moderate size of cluster. The experimental results clearly demonstrate the PAGE's robustness under different graph partition qualities and show its advantages over existing systems with up to 59% improvement.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76440846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
An analysis of crowd workers mistakes for specific and complex relevance assessment task 针对特定复杂相关性评估任务的群体工作者错误分析
J. Anderton, Maryam Bashir, Virgil Pavlu, J. Aslam
The TREC 2012 Crowdsourcing track asked participants to crowdsource relevance assessments with the goal of replicating costly expert judgements with relatively fast, inexpensive, but less reliable judgements from anonymous online workers. The track used 10 "ad-hoc" queries, highly specific and complex (as compared to web search). The crowdsourced assessments were evaluated against expert judgments made by highly trained and capable human analysts in 1999 as part of ad hoc track collection construction. Since most crowdsourcing approaches submitted to the TREC 2012 track produced assessment sets nowhere close to the expert judgements, we decided to analyze crowdsourcing mistakes made on this task using data we collected via Amazon's Mechanical Turk service. We investigate two types of crowdsourcing approaches: one that asks for nominal relevance grades for each document, and the other that asks for preferences on many (not all) pairs of documents.
TREC 2012众包跟踪要求参与者将相关性评估众包,目的是用相对快速、廉价、但不太可靠的匿名在线工作者的判断来复制昂贵的专家判断。这个赛道使用了10个“特别”的查询,非常具体和复杂(与网络搜索相比)。1999年,作为特别轨道收集建设的一部分,众包评估与训练有素和有能力的人类分析师的专家判断进行了评估。由于提交给TREC 2012轨道的大多数众包方法产生的评估集与专家判断相差甚远,我们决定使用我们通过亚马逊的Mechanical Turk服务收集的数据来分析众包在这项任务中所犯的错误。我们研究了两种类型的众包方法:一种是要求每个文档的名义相关性等级,另一种是要求对许多(不是全部)文档的偏好。
{"title":"An analysis of crowd workers mistakes for specific and complex relevance assessment task","authors":"J. Anderton, Maryam Bashir, Virgil Pavlu, J. Aslam","doi":"10.1145/2505515.2507884","DOIUrl":"https://doi.org/10.1145/2505515.2507884","url":null,"abstract":"The TREC 2012 Crowdsourcing track asked participants to crowdsource relevance assessments with the goal of replicating costly expert judgements with relatively fast, inexpensive, but less reliable judgements from anonymous online workers. The track used 10 \"ad-hoc\" queries, highly specific and complex (as compared to web search). The crowdsourced assessments were evaluated against expert judgments made by highly trained and capable human analysts in 1999 as part of ad hoc track collection construction. Since most crowdsourcing approaches submitted to the TREC 2012 track produced assessment sets nowhere close to the expert judgements, we decided to analyze crowdsourcing mistakes made on this task using data we collected via Amazon's Mechanical Turk service. We investigate two types of crowdsourcing approaches: one that asks for nominal relevance grades for each document, and the other that asks for preferences on many (not all) pairs of documents.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"20 3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76009689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Mining diabetes complication and treatment patterns for clinical decision support 挖掘糖尿病并发症和治疗模式,为临床决策提供支持
Lu Liu, Jie Tang, Yu Cheng, Ankit Agrawal, W. Liao, A. Choudhary
The fast development of hospital information systems (HIS) produces a large volume of electronic medical records, which provides a comprehensive source for exploratory analysis and statistics to support clinical decision-making. In this paper, we investigate how to utilize the heterogeneous medical records to aid the clinical treatments of diabetes mellitus. Diabetes mellitus, simply diabetes, is a group of metabolic diseases, which is often accompanied with many complications. We propose a Symptom-Diagnosis-Treatment model to mine the diabetes complication patterns and to unveil the latent association mechanism between treatments and symptoms from large volume of electronic medical records. Furthermore, we study the demographic statistics of patient population w.r.t. complication patterns in real data and observe several interesting phenomena. The discovered complication and treatment patterns can help physicians better understand their specialty and learn previous experiences. Our experiments on a collection of one-year diabetes clinical records from a famous geriatric hospital demonstrate the effectiveness of our approaches.
医院信息系统(HIS)的快速发展产生了大量的电子病历,为临床决策提供了全面的探索性分析和统计来源。本文探讨如何利用异质病案辅助糖尿病的临床治疗。糖尿病,简称糖尿病,是一组代谢性疾病,常伴有许多并发症。我们提出了一个症状-诊断-治疗模型,从大量的电子病历中挖掘糖尿病并发症的模式,揭示治疗与症状之间潜在的关联机制。此外,我们还研究了实际数据中患者群体的并发症模式的人口学统计,并观察到一些有趣的现象。发现的并发症和治疗模式可以帮助医生更好地了解他们的专业和学习以前的经验。我们对一家著名老年医院的一年期糖尿病临床记录进行了实验,证明了我们方法的有效性。
{"title":"Mining diabetes complication and treatment patterns for clinical decision support","authors":"Lu Liu, Jie Tang, Yu Cheng, Ankit Agrawal, W. Liao, A. Choudhary","doi":"10.1145/2505515.2505549","DOIUrl":"https://doi.org/10.1145/2505515.2505549","url":null,"abstract":"The fast development of hospital information systems (HIS) produces a large volume of electronic medical records, which provides a comprehensive source for exploratory analysis and statistics to support clinical decision-making. In this paper, we investigate how to utilize the heterogeneous medical records to aid the clinical treatments of diabetes mellitus. Diabetes mellitus, simply diabetes, is a group of metabolic diseases, which is often accompanied with many complications. We propose a Symptom-Diagnosis-Treatment model to mine the diabetes complication patterns and to unveil the latent association mechanism between treatments and symptoms from large volume of electronic medical records. Furthermore, we study the demographic statistics of patient population w.r.t. complication patterns in real data and observe several interesting phenomena. The discovered complication and treatment patterns can help physicians better understand their specialty and learn previous experiences. Our experiments on a collection of one-year diabetes clinical records from a famous geriatric hospital demonstrate the effectiveness of our approaches.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86920016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
DeExcelerator: a framework for extracting relational data from partially structured documents DeExcelerator:用于从部分结构化文档中提取关系数据的框架
Julian Eberius, Christoper Werner, Maik Thiele, Katrin Braunschweig, Lars Dannecker, Wolfgang Lehner
Of the structured data published on the web, for instance as datasets on Open Data Platforms such as data.gov, but also in the form of HTML tables on the general web, only a small part is in a relational form. Instead the data is intermingled with formatting, layout and textual metadata, i.e., it is contained in partially structured documents. This makes transformation into a true relational form necessary, which is a precondition for most forms of data analysis and data integration. Studying data.gov as an example source for partially structured documents, we present a classification of typical normalization problems. We then present the DeExcelerator, which is a framework for extracting relations from partially structured documents such as spreadsheets and HTML tables.
在web上发布的结构化数据中,例如在data.gov等开放数据平台上的数据集,以及在一般web上以HTML表的形式发布的数据,只有一小部分是以关系形式发布的。相反,数据与格式、布局和文本元数据混合在一起,也就是说,它包含在部分结构化的文档中。这使得转换为真正的关系形式成为必要,这是大多数数据分析和数据集成形式的先决条件。我们以data.gov作为部分结构化文档的示例源,对典型的规范化问题进行了分类。然后介绍DeExcelerator,它是一个框架,用于从电子表格和HTML表等部分结构化文档中提取关系。
{"title":"DeExcelerator: a framework for extracting relational data from partially structured documents","authors":"Julian Eberius, Christoper Werner, Maik Thiele, Katrin Braunschweig, Lars Dannecker, Wolfgang Lehner","doi":"10.1145/2505515.2508210","DOIUrl":"https://doi.org/10.1145/2505515.2508210","url":null,"abstract":"Of the structured data published on the web, for instance as datasets on Open Data Platforms such as data.gov, but also in the form of HTML tables on the general web, only a small part is in a relational form. Instead the data is intermingled with formatting, layout and textual metadata, i.e., it is contained in partially structured documents. This makes transformation into a true relational form necessary, which is a precondition for most forms of data analysis and data integration. Studying data.gov as an example source for partially structured documents, we present a classification of typical normalization problems. We then present the DeExcelerator, which is a framework for extracting relations from partially structured documents such as spreadsheets and HTML tables.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"92 10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87741648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
The first workshop on user engagement optimization 第一个关于用户粘性优化的研讨会
Liangjie Hong, Shuang-Hong Yang
Online user engagement optimization is key to many Internet business. Several research areas are related to the concept of online user engagement optimization, including machine learning, data mining, information retrieval, recommender systems, online A/B (bucket) testing and psychology. In the past, research efforts in this direction are pursued in separate communities and conferences, yielding potential disconnected and repeated results. In addition, researchers and practitioners are sometimes only exposed to a specific aspect of the topic, which might be incomplete and suboptimal to the whole picture. Here, we organize the first workshop on the topic of online user engagement optimization, explicitly targeting the topic as a whole and bring researchers and practitioners together to foster the field. We invite two leading researchers from industry to give keynote talks about online machine learning and online experimentations. In addition, several invited talks from industry and academic researchers have covered the topics of content personalization, online experimental platforms and recommender systems. Also, six novel submissions are included as short papers in the workshop such that new results are discussed and shared among the workshop.
在线用户参与优化是许多互联网业务的关键。有几个研究领域与在线用户参与优化的概念相关,包括机器学习、数据挖掘、信息检索、推荐系统、在线A/B(桶)测试和心理学。在过去,这方面的研究工作是在不同的社区和会议上进行的,产生了潜在的不连贯和重复的结果。此外,研究人员和实践者有时只接触到主题的一个特定方面,这可能是不完整的,对于整个画面来说是次优的。在这里,我们组织了关于在线用户参与优化主题的第一次研讨会,明确地将该主题作为一个整体,并将研究人员和从业者聚集在一起,以促进该领域的发展。我们邀请了两位业界领先的研究人员就在线机器学习和在线实验进行主题演讲。此外,还邀请了来自业界和学术界研究人员的几场演讲,涵盖了内容个性化、在线实验平台和推荐系统等主题。此外,六篇新颖的论文将被作为短篇论文纳入研讨会,以便在研讨会上讨论和分享新的成果。
{"title":"The first workshop on user engagement optimization","authors":"Liangjie Hong, Shuang-Hong Yang","doi":"10.1145/2505515.2505816","DOIUrl":"https://doi.org/10.1145/2505515.2505816","url":null,"abstract":"Online user engagement optimization is key to many Internet business. Several research areas are related to the concept of online user engagement optimization, including machine learning, data mining, information retrieval, recommender systems, online A/B (bucket) testing and psychology. In the past, research efforts in this direction are pursued in separate communities and conferences, yielding potential disconnected and repeated results. In addition, researchers and practitioners are sometimes only exposed to a specific aspect of the topic, which might be incomplete and suboptimal to the whole picture. Here, we organize the first workshop on the topic of online user engagement optimization, explicitly targeting the topic as a whole and bring researchers and practitioners together to foster the field. We invite two leading researchers from industry to give keynote talks about online machine learning and online experimentations. In addition, several invited talks from industry and academic researchers have covered the topics of content personalization, online experimental platforms and recommender systems. Also, six novel submissions are included as short papers in the workshop such that new results are discussed and shared among the workshop.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"2010 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86272759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Efficient pruning algorithm for top-K ranking on dataset with value uncertainty 具有不确定性数据集top-K排序的高效剪枝算法
Jianwen Chen, Ling Feng
Top-K ranking query in uncertain databases aims to find the top-K tuples according to a ranking function. The interplay between score and uncertainty makes top-K ranking in uncertain databases an intriguing issue, leading to rich query semantics. Recently, a unified ranking framework based on parameterized ranking functions (PRFs) is formulated, which generalizes many previously proposed ranking semantics. Under the PRFs based ranking framework, efficient pruning approach for Top-K ranking on dataset with tuple uncertainty has been well studied in the literature. However, this cannot be applied to top-K ranking on dataset with value uncertainty (described through attribute-level uncertain data model), which are often natural and useful in analyzing uncertain data in many applications. This paper aims to develop efficient pruning techniques for top-K ranking on dataset with value uncertainty under the PRFs based ranking framework, which has not been well studied in the literature. We present the mathematics of deriving the pruning techniques and the corresponding algorithms. The experimental results on both real and synthetic data demonstrate the effectiveness and efficiency of the proposed pruning techniques.
不确定数据库中Top-K排序查询的目的是根据排序函数找到Top-K元组。分数和不确定性之间的相互作用使得不确定数据库中的top-K排名成为一个有趣的问题,从而产生丰富的查询语义。近年来,提出了一种基于参数化排序函数(PRFs)的统一排序框架,对已有的许多排序语义进行了推广。在基于PRFs的排序框架下,对具有元组不确定性的数据集Top-K排序的高效剪枝方法进行了较为深入的研究。然而,这并不适用于具有值不确定性(通过属性级不确定数据模型描述)的数据集的top-K排序,这在许多应用中通常是自然和有用的,用于分析不确定数据。本文的目的是在基于PRFs的排序框架下,开发具有价值不确定性的数据集top-K排序的高效修剪技术,这在文献中还没有得到很好的研究。我们给出了推导剪枝技术和相应算法的数学方法。在真实数据和合成数据上的实验结果证明了所提出的修剪技术的有效性和效率。
{"title":"Efficient pruning algorithm for top-K ranking on dataset with value uncertainty","authors":"Jianwen Chen, Ling Feng","doi":"10.1145/2505515.2505735","DOIUrl":"https://doi.org/10.1145/2505515.2505735","url":null,"abstract":"Top-K ranking query in uncertain databases aims to find the top-K tuples according to a ranking function. The interplay between score and uncertainty makes top-K ranking in uncertain databases an intriguing issue, leading to rich query semantics. Recently, a unified ranking framework based on parameterized ranking functions (PRFs) is formulated, which generalizes many previously proposed ranking semantics. Under the PRFs based ranking framework, efficient pruning approach for Top-K ranking on dataset with tuple uncertainty has been well studied in the literature. However, this cannot be applied to top-K ranking on dataset with value uncertainty (described through attribute-level uncertain data model), which are often natural and useful in analyzing uncertain data in many applications. This paper aims to develop efficient pruning techniques for top-K ranking on dataset with value uncertainty under the PRFs based ranking framework, which has not been well studied in the literature. We present the mathematics of deriving the pruning techniques and the corresponding algorithms. The experimental results on both real and synthetic data demonstrate the effectiveness and efficiency of the proposed pruning techniques.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83857240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
Proceedings of the 22nd ACM international conference on Information & Knowledge Management
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1