首页 > 最新文献

Advances in database technology : proceedings. International Conference on Extending Database Technology最新文献

英文 中文
A Supervised Skyline-Based Algorithm for Spatial Entity Linkage 基于监督天际线的空间实体关联算法
Suela Isaj, Vassilis Kaffes, T. Pedersen, G. Giannopoulos
The ease of publishing data on the web has contributed to larger and more diverse types of data. Entities that refer to a physical place and are characterized by a location and different attributes are named spatial entities. Even though the amount of spatial entity data from multiple sources keeps increasing, facilitating the development of richer, more accurate and more comprehensive geospatial applications and services, there is unavoidable redundancy and ambiguity. We address the problem of spatial entity linkage with SkylineExplore-Trained (SkyEx-T ), a skyline-based algorithm that can label an entity pair as being the same physical entity or not. We introduce LinkGeoML-eXtended (LGM-X ), a meta-similarity function that computes similarity features specifically tailored to the specificities of spatial entities. The skylines of SkyEx-T are created using a preference function, which ranks the pairs based on the likelihood of referring to the same entity. We propose deriving the preference function using a tiny training set (down to 0.05% of the dataset). Additionally, we provide a theoretical guarantee for the cut-off that can best separate the classes, and we show experimentally that it results in a nearoptimal F-measure (on average only 2% loss). SkyEx-T yields an F-measure of 0.71-0.74 and beats the existing non-skyline-based baselines with a margin of 0.11-0.39 in F-measure. When compared to machine learning techniques, SkyEx-T is able to achieve a similar accuracy (sometimes slightly better one in very small training sets) and more importantly, having no-parameters to tune and a model that is already explainable (no need for further actions to achieve explainability).
在网络上发布数据的便利性促成了更大、更多样化的数据类型。引用物理位置并以位置和不同属性为特征的实体称为空间实体。尽管来自多个来源的空间实体数据量不断增加,有助于开发更丰富、更准确、更全面的地理空间应用和服务,但不可避免地存在冗余和歧义。我们使用SkylineExplore-Trained (SkyEx-T)来解决空间实体链接问题,SkyEx-T是一种基于天际线的算法,可以将实体对标记为相同的物理实体或不相同。我们介绍了LinkGeoML-eXtended (LGM-X),这是一个元相似性函数,用于计算专门针对空间实体特殊性定制的相似性特征。SkyEx-T的天际线是使用偏好函数创建的,该函数根据引用同一实体的可能性对它们进行排序。我们建议使用一个很小的训练集(小到数据集的0.05%)来推导偏好函数。此外,我们为截断点提供了一个理论保证,它可以最好地分离类别,并且我们通过实验表明,它可以产生接近最优的F-measure(平均只有2%的损失)。SkyEx-T的f值为0.71-0.74,比现有的非天际线基线的f值高出0.11-0.39。与机器学习技术相比,SkyEx-T能够达到类似的精度(有时在非常小的训练集中稍微好一点),更重要的是,没有参数需要调整,模型已经可以解释(不需要进一步的行动来实现可解释性)。
{"title":"A Supervised Skyline-Based Algorithm for Spatial Entity Linkage","authors":"Suela Isaj, Vassilis Kaffes, T. Pedersen, G. Giannopoulos","doi":"10.48786/edbt.2022.11","DOIUrl":"https://doi.org/10.48786/edbt.2022.11","url":null,"abstract":"The ease of publishing data on the web has contributed to larger and more diverse types of data. Entities that refer to a physical place and are characterized by a location and different attributes are named spatial entities. Even though the amount of spatial entity data from multiple sources keeps increasing, facilitating the development of richer, more accurate and more comprehensive geospatial applications and services, there is unavoidable redundancy and ambiguity. We address the problem of spatial entity linkage with SkylineExplore-Trained (SkyEx-T ), a skyline-based algorithm that can label an entity pair as being the same physical entity or not. We introduce LinkGeoML-eXtended (LGM-X ), a meta-similarity function that computes similarity features specifically tailored to the specificities of spatial entities. The skylines of SkyEx-T are created using a preference function, which ranks the pairs based on the likelihood of referring to the same entity. We propose deriving the preference function using a tiny training set (down to 0.05% of the dataset). Additionally, we provide a theoretical guarantee for the cut-off that can best separate the classes, and we show experimentally that it results in a nearoptimal F-measure (on average only 2% loss). SkyEx-T yields an F-measure of 0.71-0.74 and beats the existing non-skyline-based baselines with a margin of 0.11-0.39 in F-measure. When compared to machine learning techniques, SkyEx-T is able to achieve a similar accuracy (sometimes slightly better one in very small training sets) and more importantly, having no-parameters to tune and a model that is already explainable (no need for further actions to achieve explainability).","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"1 1","pages":"2:220-2:233"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90114839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Gamma Probabilistic Databases: Learning from Exchangeable Query-Answers 伽玛概率数据库:从可交换的查询-答案中学习
Niccolò Meneghetti, Ouael Ben Amara
In this paper we propose a novel knowledge compilation technique that compiles Bayesian inference procedures, starting from probabilistic programs expressed in terms of probabilistic queryanswers. To do so, we extend the framework of Dirichlet Probabilistic Databases with the ability to process exchangeable observations of query-answers. We show that the resulting framework can encode non-trivial models, like Latent Dirichlet Allocation and the Ising model, and generate high-performance Gibbs samplers for both models.
在本文中,我们提出了一种新的知识编译技术,该技术从用概率查询答案表示的概率程序开始编译贝叶斯推理程序。为此,我们扩展了Dirichlet概率数据库的框架,使其具有处理查询-答案的交换观察的能力。我们证明了所得到的框架可以编码非平凡模型,如Latent Dirichlet Allocation和Ising模型,并为这两个模型生成高性能的Gibbs采样器。
{"title":"Gamma Probabilistic Databases: Learning from Exchangeable Query-Answers","authors":"Niccolò Meneghetti, Ouael Ben Amara","doi":"10.48786/edbt.2022.14","DOIUrl":"https://doi.org/10.48786/edbt.2022.14","url":null,"abstract":"In this paper we propose a novel knowledge compilation technique that compiles Bayesian inference procedures, starting from probabilistic programs expressed in terms of probabilistic queryanswers. To do so, we extend the framework of Dirichlet Probabilistic Databases with the ability to process exchangeable observations of query-answers. We show that the resulting framework can encode non-trivial models, like Latent Dirichlet Allocation and the Ising model, and generate high-performance Gibbs samplers for both models.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"42 1","pages":"2:260-2:273"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91523150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Conceptual models and databases for searching the genome 用于搜索基因组的概念模型和数据库
Anna Bernasconi, Pietro Pinoli
Genomics is an extremely complex domain, in terms of concepts, their relations, and their representations in data. This tutorial in-troduces the use of ER models in the context of genomic systems: conceptual models are of great help for simplifying this domain and making it actionable. We carry out a review of successful models presented in the literature for representing biologically-relevant entities and grounding them in databases. We draw a difference between conceptual models that aim to explain the domain and conceptual models that aim to support database design and heterogeneous data integration. Genomic experiments and/or sequences are described by several metadata, specify-ing information on the sampled organism, the used technology, and the organizational process behind the experiment. Instead, we call data the actual regions of the genome that have been read by sequencing technologies and encoded into a machine-readable representation. First, we show how data and metadata can be modeled, then we exploit the proposed models for de-signing search systems, visualizers, and analysis environments. Both domains of human genomics and viral genomics are addressed, surveying several use cases and applications of broader public interest. The tutorial is relevant to the EDBT community because it demonstrates the usefulness of conceptual models’ principles within very current domains; in addition, it offers a concrete example of conceptual models’ use, setting the premises for interdisciplinary collaboration with a greater public (possibly including life science researchers).
基因组学是一个极其复杂的领域,就概念、它们之间的关系以及它们在数据中的表示而言。本教程介绍了在基因组系统上下文中ER模型的使用:概念模型对于简化该领域并使其具有可操作性有很大帮助。我们对文献中提出的成功模型进行了回顾,这些模型用于表示生物学相关实体并将它们置于数据库中。我们在旨在解释领域的概念模型和旨在支持数据库设计和异构数据集成的概念模型之间进行了区分。基因组实验和/或序列由几个元数据来描述,这些元数据指定了关于采样生物体的信息、使用的技术和实验背后的组织过程。相反,我们称数据为基因组的实际区域,这些区域已被测序技术读取并编码为机器可读的表示形式。首先,我们将展示如何对数据和元数据进行建模,然后利用所提出的模型来设计搜索系统、可视化器和分析环境。这两个领域的人类基因组学和病毒基因组学是解决,调查几个用例和更广泛的公共利益的应用。本教程与EDBT社区相关,因为它展示了概念模型原则在当前领域中的有用性;此外,它还提供了一个概念模型使用的具体例子,为与更多公众(可能包括生命科学研究人员)进行跨学科合作奠定了基础。
{"title":"Conceptual models and databases for searching the genome","authors":"Anna Bernasconi, Pietro Pinoli","doi":"10.48786/edbt.2022.57","DOIUrl":"https://doi.org/10.48786/edbt.2022.57","url":null,"abstract":"Genomics is an extremely complex domain, in terms of concepts, their relations, and their representations in data. This tutorial in-troduces the use of ER models in the context of genomic systems: conceptual models are of great help for simplifying this domain and making it actionable. We carry out a review of successful models presented in the literature for representing biologically-relevant entities and grounding them in databases. We draw a difference between conceptual models that aim to explain the domain and conceptual models that aim to support database design and heterogeneous data integration. Genomic experiments and/or sequences are described by several metadata, specify-ing information on the sampled organism, the used technology, and the organizational process behind the experiment. Instead, we call data the actual regions of the genome that have been read by sequencing technologies and encoded into a machine-readable representation. First, we show how data and metadata can be modeled, then we exploit the proposed models for de-signing search systems, visualizers, and analysis environments. Both domains of human genomics and viral genomics are addressed, surveying several use cases and applications of broader public interest. The tutorial is relevant to the EDBT community because it demonstrates the usefulness of conceptual models’ principles within very current domains; in addition, it offers a concrete example of conceptual models’ use, setting the premises for interdisciplinary collaboration with a greater public (possibly including life science researchers).","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"40 1","pages":"1-4"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86479645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Implementing Distributed Similarity Joins using Locality Sensitive Hashing 使用位置敏感散列实现分布式相似连接
Martin Aumüller, Matteo Ceccarello
{"title":"Implementing Distributed Similarity Joins using Locality Sensitive Hashing","authors":"Martin Aumüller, Matteo Ceccarello","doi":"10.5441/002/edbt.2022.07","DOIUrl":"https://doi.org/10.5441/002/edbt.2022.07","url":null,"abstract":"","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"108 1","pages":"1:78-1:90"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85542619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Neural Approach to Forming Coherent Teams in Collaboration Networks 协作网络中形成连贯团队的神经方法
Radin Hamidi Rad, Shirin Seyedsalehi, M. Kargar, Morteza Zihayat, E. Bagheri
We study team formation whose goal is to form a team of experts who collectively cover a set of desirable skills. This problem has mainly been addressed either through graph search techniques, which look for subgraphs that satisfy a set of skill requirements, or through neural architectures that learn a mapping from the skill space to the expert space. An exact graph-based solution to this problem is intractable and its heuristic variants are only able to identify sub-optimal solutions. On the other hand, neural architecture-based solutions treat experts individually without concern for team dynamics. In this paper, we address the task of forming coherent teams and propose a neural approach that maximizes the likelihood of successful collaboration among team members while maximizing the coverage of the required skills by the team. Our extensive experiments show that the proposed approach outperforms the state-of-the-art methods in terms of both ranking and quality metrics.
我们研究团队形成,其目标是形成一个专家团队,他们共同拥有一组所需的技能。这个问题主要是通过图搜索技术来解决的,图搜索技术寻找满足一组技能要求的子图,或者通过神经结构来学习从技能空间到专家空间的映射。这个问题的精确的基于图的解是难以处理的,它的启发式变体只能识别次优解。另一方面,基于神经体系结构的解决方案将专家单独对待,而不关心团队动态。在本文中,我们解决了形成连贯团队的任务,并提出了一种神经方法,该方法可以最大限度地提高团队成员之间成功协作的可能性,同时最大限度地提高团队所需技能的覆盖率。我们广泛的实验表明,所提出的方法在排名和质量指标方面都优于最先进的方法。
{"title":"A Neural Approach to Forming Coherent Teams in Collaboration Networks","authors":"Radin Hamidi Rad, Shirin Seyedsalehi, M. Kargar, Morteza Zihayat, E. Bagheri","doi":"10.48786/edbt.2022.37","DOIUrl":"https://doi.org/10.48786/edbt.2022.37","url":null,"abstract":"We study team formation whose goal is to form a team of experts who collectively cover a set of desirable skills. This problem has mainly been addressed either through graph search techniques, which look for subgraphs that satisfy a set of skill requirements, or through neural architectures that learn a mapping from the skill space to the expert space. An exact graph-based solution to this problem is intractable and its heuristic variants are only able to identify sub-optimal solutions. On the other hand, neural architecture-based solutions treat experts individually without concern for team dynamics. In this paper, we address the task of forming coherent teams and propose a neural approach that maximizes the likelihood of successful collaboration among team members while maximizing the coverage of the required skills by the team. Our extensive experiments show that the proposed approach outperforms the state-of-the-art methods in terms of both ranking and quality metrics.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"80 2 1","pages":"2:440-2:444"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90844973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Voyager: Data Discovery and Integration for Onboarding in Data Science 航海家:数据科学入职的数据发现和集成
Alex Bogatu, N. Paton, Mark Douthwaite, A. Freitas
{"title":"Voyager: Data Discovery and Integration for Onboarding in Data Science","authors":"Alex Bogatu, N. Paton, Mark Douthwaite, A. Freitas","doi":"10.48786/edbt.2022.47","DOIUrl":"https://doi.org/10.48786/edbt.2022.47","url":null,"abstract":"","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"3 2","pages":"2:537-2:548"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72634934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Evaluation of Algorithms for Interaction-Sparse Recommendations: Neural Networks don't Always Win 交互稀疏推荐算法的评估:神经网络并不总是获胜
Yasamin Klingler, Claude Lehmann, J. Monteiro, Carlo Saladin, A. Bernstein, Kurt Stockinger
In recent years, top-K recommender systems with implicit feedback data gained interest in many real-world business scenarios. In particular, neural networks have shown promising results on these tasks. However, while traditional recommender systems are built on datasets with frequent user interactions, insurance recommenders often have access to a very limited amount of user interactions, as people only buy a few insurance products. In this paper, we shed new light on the problem of top-K recommendations for interaction-sparse recommender problems. In particular, we analyze six different recommender algorithms, namely a popularity-based baseline and compare it against two matrix factorization methods (SVD++, ALS), one neural network approach (JCA) and two combinations of neural network and factorization machine approaches (DeepFM, NeuFM). We evaluate these algorithms on six different interaction-sparse datasets and one dataset with a less sparse interaction pattern to elucidate the unique behavior of interaction-sparse datasets. In our experimental evaluation based on real-world insurance data, we demonstrate that DeepFM shows the best performance followed by JCA and SVD++, which indicates that neural network approaches are the dominant technologies. However, for the remaining five datasets we observe a different pattern. Overall, the matrix factorization method SVD++ is the winner. Surprisingly, the simple popularity-based approach comes out second followed by the neural network approach JCA. In summary, our experimental evaluation for interaction-sparse datasets demonstrates that in general matrix factorization methods outperform neural network approaches. As a consequence, traditional wellestablished methods should be part of the portfolio of algorithms to solve real-world interaction-sparse recommender problems.
近年来,具有隐式反馈数据的顶级推荐系统在许多现实世界的业务场景中引起了人们的兴趣。特别是,神经网络在这些任务上显示出有希望的结果。然而,虽然传统的推荐系统是建立在用户频繁交互的数据集上的,但保险推荐通常只能访问非常有限的用户交互,因为人们只购买少数保险产品。在本文中,我们对交互稀疏推荐问题的top-K推荐问题进行了新的阐述。特别是,我们分析了六种不同的推荐算法,即基于人气的基线,并将其与两种矩阵分解方法(svd++, ALS),一种神经网络方法(JCA)和两种神经网络和分解机方法的组合(DeepFM, NeuFM)进行比较。我们在六个不同的交互稀疏数据集和一个具有较少稀疏交互模式的数据集上评估了这些算法,以阐明交互稀疏数据集的独特行为。在基于真实保险数据的实验评估中,我们证明了DeepFM表现出最好的性能,其次是JCA和svd++,这表明神经网络方法是主导技术。然而,对于剩下的五个数据集,我们观察到一个不同的模式。总的来说,矩阵分解方法svd++是赢家。令人惊讶的是,简单的基于人气的方法排在第二位,其次是神经网络方法JCA。总之,我们对交互稀疏数据集的实验评估表明,一般情况下,矩阵分解方法优于神经网络方法。因此,传统的完善的方法应该成为解决现实世界中交互稀疏推荐问题的算法组合的一部分。
{"title":"Evaluation of Algorithms for Interaction-Sparse Recommendations: Neural Networks don't Always Win","authors":"Yasamin Klingler, Claude Lehmann, J. Monteiro, Carlo Saladin, A. Bernstein, Kurt Stockinger","doi":"10.48786/edbt.2022.42","DOIUrl":"https://doi.org/10.48786/edbt.2022.42","url":null,"abstract":"In recent years, top-K recommender systems with implicit feedback data gained interest in many real-world business scenarios. In particular, neural networks have shown promising results on these tasks. However, while traditional recommender systems are built on datasets with frequent user interactions, insurance recommenders often have access to a very limited amount of user interactions, as people only buy a few insurance products. In this paper, we shed new light on the problem of top-K recommendations for interaction-sparse recommender problems. In particular, we analyze six different recommender algorithms, namely a popularity-based baseline and compare it against two matrix factorization methods (SVD++, ALS), one neural network approach (JCA) and two combinations of neural network and factorization machine approaches (DeepFM, NeuFM). We evaluate these algorithms on six different interaction-sparse datasets and one dataset with a less sparse interaction pattern to elucidate the unique behavior of interaction-sparse datasets. In our experimental evaluation based on real-world insurance data, we demonstrate that DeepFM shows the best performance followed by JCA and SVD++, which indicates that neural network approaches are the dominant technologies. However, for the remaining five datasets we observe a different pattern. Overall, the matrix factorization method SVD++ is the winner. Surprisingly, the simple popularity-based approach comes out second followed by the neural network approach JCA. In summary, our experimental evaluation for interaction-sparse datasets demonstrates that in general matrix factorization methods outperform neural network approaches. As a consequence, traditional wellestablished methods should be part of the portfolio of algorithms to solve real-world interaction-sparse recommender problems.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"39 1","pages":"2:475-2:486"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85232386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
JupySim: Jupyter Notebook Similarity Search System JupySim: Jupyter笔记本相似度搜索系统
Misato Horiuchi, Yuya Sasaki, Chuan Xiao, Makoto Onizuka
Computational notebooks such as Jupyter notebooks are popular for machine learning and data analytic tasks. Numerous computational notebooks are available on the Web and reusable; however, searching for computational notebooks manually is a tedious task and so far there are no tools to search for computational notebooks effectively and efficiently. In this paper, we develop JupySim , which is a system for similarity search on Jupyter notebooks. In JupySim , users specify contents (codes, tabular data, libraries, and formats of outputs) in Jupyter notebooks as a query, and then retrieve top- 𝑘 Jupyter notebooks with the most similar contents to the given query. The characteristic of JupySim is that the queries and Jupyter notebooks are modeled by graphs for capturing the relationships between codes, data, and outputs. JupySim has intuitive user interfaces that the users can specify their targets of Jupyter notebooks easily. Our demonstration scenarios show that JupySim is effective to find Jupyter notebooks shared on Kaggle for data science.
像Jupyter笔记本这样的计算笔记本在机器学习和数据分析任务中很受欢迎。网络上有许多可重复使用的计算笔记本;然而,手动搜索计算性笔记本是一项繁琐的任务,目前还没有有效和高效的搜索计算性笔记本的工具。在本文中,我们开发了JupySim,这是一个关于Jupyter笔记本的相似度搜索系统。在JupySim中,用户在Jupyter笔记本中指定内容(代码、表格数据、库和输出格式)作为查询,然后检索与给定查询内容最相似的top-𝑘Jupyter笔记本。JupySim的特点是查询和Jupyter笔记本是通过图形建模的,用于捕获代码、数据和输出之间的关系。JupySim具有直观的用户界面,用户可以很容易地指定他们的JupySim笔记本目标。我们的演示场景表明,JupySim可以有效地查找Kaggle上共享的用于数据科学的Jupyter笔记本。
{"title":"JupySim: Jupyter Notebook Similarity Search System","authors":"Misato Horiuchi, Yuya Sasaki, Chuan Xiao, Makoto Onizuka","doi":"10.48786/edbt.2022.49","DOIUrl":"https://doi.org/10.48786/edbt.2022.49","url":null,"abstract":"Computational notebooks such as Jupyter notebooks are popular for machine learning and data analytic tasks. Numerous computational notebooks are available on the Web and reusable; however, searching for computational notebooks manually is a tedious task and so far there are no tools to search for computational notebooks effectively and efficiently. In this paper, we develop JupySim , which is a system for similarity search on Jupyter notebooks. In JupySim , users specify contents (codes, tabular data, libraries, and formats of outputs) in Jupyter notebooks as a query, and then retrieve top- 𝑘 Jupyter notebooks with the most similar contents to the given query. The characteristic of JupySim is that the queries and Jupyter notebooks are modeled by graphs for capturing the relationships between codes, data, and outputs. JupySim has intuitive user interfaces that the users can specify their targets of Jupyter notebooks easily. Our demonstration scenarios show that JupySim is effective to find Jupyter notebooks shared on Kaggle for data science.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"100 1","pages":"2:554-2:557"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84012605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Unsupervised Selectivity Estimation by Integrating Gaussian Mixture Models and an Autoregressive Model 基于高斯混合模型和自回归模型的无监督选择性估计
Zizhong Meng, Peizhi Wu, Gao Cong, Rong Zhu, Shuai Ma
Selectivity estimation is a fundamental database task, which has been studied for decades. A recent trend is to use deep learning methods for selectivity estimation. Deep autoregressive models have been reported to achieve excellent accuracy. However, if the relation has continuous attributes with large domain sizes, the search space of query inference on deep autoregressive models can be very large, resulting in inaccurate estimation and inefficient inference. To address this challenge, we propose a new model that integrates multiple Gaussian mixture models and a deep autoregressive model. On the one hand, Gaussian mixture models can fit the distribution of continuous attributes and reduce their domain sizes. On the other hand, deep autoregressive model can learn the joint data distribution with reduced domain attributes. In experiments, we compare with multiple baselines on 4 real-world datasets containing continuous attributes, and the experimental results demonstrate that our model can achieve up to 20 times higher accuracy than the second best estimators, while using less space and inference time.
选择性估计是一项基本的数据库任务,已经被研究了几十年。最近的一个趋势是使用深度学习方法进行选择性估计。据报道,深度自回归模型达到了很高的精度。然而,如果关系具有连续属性且域大小较大,则深度自回归模型查询推理的搜索空间可能非常大,导致估计不准确,推理效率低下。为了解决这一挑战,我们提出了一个新的模型,该模型集成了多个高斯混合模型和一个深度自回归模型。一方面,高斯混合模型可以拟合连续属性的分布,减小属性的域大小;另一方面,深度自回归模型可以学习具有约简域属性的联合数据分布。在实验中,我们在4个包含连续属性的真实数据集上与多个基线进行了比较,实验结果表明,我们的模型在使用更少的空间和推理时间的情况下,可以实现比第二好的估计器高20倍的精度。
{"title":"Unsupervised Selectivity Estimation by Integrating Gaussian Mixture Models and an Autoregressive Model","authors":"Zizhong Meng, Peizhi Wu, Gao Cong, Rong Zhu, Shuai Ma","doi":"10.48786/edbt.2022.13","DOIUrl":"https://doi.org/10.48786/edbt.2022.13","url":null,"abstract":"Selectivity estimation is a fundamental database task, which has been studied for decades. A recent trend is to use deep learning methods for selectivity estimation. Deep autoregressive models have been reported to achieve excellent accuracy. However, if the relation has continuous attributes with large domain sizes, the search space of query inference on deep autoregressive models can be very large, resulting in inaccurate estimation and inefficient inference. To address this challenge, we propose a new model that integrates multiple Gaussian mixture models and a deep autoregressive model. On the one hand, Gaussian mixture models can fit the distribution of continuous attributes and reduce their domain sizes. On the other hand, deep autoregressive model can learn the joint data distribution with reduced domain attributes. In experiments, we compare with multiple baselines on 4 real-world datasets containing continuous attributes, and the experimental results demonstrate that our model can achieve up to 20 times higher accuracy than the second best estimators, while using less space and inference time.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"10 1","pages":"2:247-2:259"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82040718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Towards A General SIMD Concurrent Approach to Accelerating Integer Compression Algorithms 加速整数压缩算法的通用SIMD并发方法
Juliana Hildebrandt, Dirk Habich, Wolfgang Lehner
Integer compression algorithms play an important role in columnoriented data systems. Previous research has shown that the vectorized implementation of these algorithms based on the Single Instruction Multiple Data (SIMD) parallel paradigm can multiply the compression as well as decompression speeds. While a scalar compression algorithm usually compresses a block of N consecutive integers, the state-of-the-art SIMD implementation scales the block size to k ∗ N with k as the number of elements which could be simultaneously processed in a SIMD register. However, this means that as the SIMD register size increases, the block of integer values for compression also grows, which can have a negative effect on the compression ratio. In this paper, we analyze this effect and present an idea for a novel general approach for the SIMD implementation of integer compression algorithms to overcome that effect. Our novel idea is to concurrently compress k different blocks of size N within SIMD registers. To show the applicability of our idea, we present initial evaluation results for a heavily used compression algorithm and show that our approach can lead to more responsible usage of main memory resources.
整数压缩算法在面向列的数据系统中起着重要的作用。先前的研究表明,基于单指令多数据(SIMD)并行范式的这些算法的矢量化实现可以提高压缩和解压速度。标量压缩算法通常压缩一个由N个连续整数组成的块,而最先进的SIMD实现将块大小缩放为k * N,其中k是SIMD寄存器中可以同时处理的元素数。然而,这意味着随着SIMD寄存器大小的增加,用于压缩的整数值块也会增加,这可能会对压缩比产生负面影响。在本文中,我们分析了这种影响,并提出了一种新的通用方法,用于整数压缩算法的SIMD实现,以克服这种影响。我们的新想法是在SIMD寄存器中并发地压缩k个大小为N的不同块。为了表明我们的想法的适用性,我们给出了一个被大量使用的压缩算法的初始评估结果,并表明我们的方法可以更负责地使用主存资源。
{"title":"Towards A General SIMD Concurrent Approach to Accelerating Integer Compression Algorithms","authors":"Juliana Hildebrandt, Dirk Habich, Wolfgang Lehner","doi":"10.48786/edbt.2022.32","DOIUrl":"https://doi.org/10.48786/edbt.2022.32","url":null,"abstract":"Integer compression algorithms play an important role in columnoriented data systems. Previous research has shown that the vectorized implementation of these algorithms based on the Single Instruction Multiple Data (SIMD) parallel paradigm can multiply the compression as well as decompression speeds. While a scalar compression algorithm usually compresses a block of N consecutive integers, the state-of-the-art SIMD implementation scales the block size to k ∗ N with k as the number of elements which could be simultaneously processed in a SIMD register. However, this means that as the SIMD register size increases, the block of integer values for compression also grows, which can have a negative effect on the compression ratio. In this paper, we analyze this effect and present an idea for a novel general approach for the SIMD implementation of integer compression algorithms to overcome that effect. Our novel idea is to concurrently compress k different blocks of size N within SIMD registers. To show the applicability of our idea, we present initial evaluation results for a heavily used compression algorithm and show that our approach can lead to more responsible usage of main memory resources.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"12 1","pages":"2:414-2:418"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83962746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Advances in database technology : proceedings. International Conference on Extending Database Technology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1