Proceedings of the International Workshop on Semantic Big Data最新文献

英文中文

SPARTI 巴达

Proceedings of the International Workshop on Semantic Big Data

Pub Date : 2018-06-10 DOI: 10.1145/3208352.3208356

Amgad Madkour, Walid G. Aref, Ahmed M. Aly

Semantic data is an integral component for search engines that provide answers beyond simple keyword-based matches. Resource Description Framework (RDF) provides a standardized and flexible graph model for representing semantic data. The astronomical growth of RDF data raises the need for scalable RDF management strategies. Although cloud-based systems provide a rich platform for managing large-scale RDF data, the shared storage provided by these systems introduces several performance challenges, e.g., disk I/O and network shuffling overhead. This paper investigates SPARTI, a scalable RDF data management system. In SPARTI, the partitioning of the data is based on the join patterns found in the query workload. Initially, SPARTI vertically partitions the RDF data, and then incrementally updates the partitioning according to the workload, which improves the query performance of frequent join patterns. SPARTI utilizes a partitioning schema, termed SemVP, that enables the system to read a reduced set of rows instead of entire partitions. SPARTI proposes a budgeting mechanism with a cost model to determine the worthiness of partitioning. Using real and synthetic datasets, SPARTI is compared against a Spark-based state-of-the-art system and is shown to execute queries around half the time over all query shapes while maintaining around an order of magnitude enhancement in storage requirements.

{"title":"SPARTI","authors":"Amgad Madkour, Walid G. Aref, Ahmed M. Aly","doi":"10.1145/3208352.3208356","DOIUrl":"https://doi.org/10.1145/3208352.3208356","url":null,"abstract":"Semantic data is an integral component for search engines that provide answers beyond simple keyword-based matches. Resource Description Framework (RDF) provides a standardized and flexible graph model for representing semantic data. The astronomical growth of RDF data raises the need for scalable RDF management strategies. Although cloud-based systems provide a rich platform for managing large-scale RDF data, the shared storage provided by these systems introduces several performance challenges, e.g., disk I/O and network shuffling overhead. This paper investigates SPARTI, a scalable RDF data management system. In SPARTI, the partitioning of the data is based on the join patterns found in the query workload. Initially, SPARTI vertically partitions the RDF data, and then incrementally updates the partitioning according to the workload, which improves the query performance of frequent join patterns. SPARTI utilizes a partitioning schema, termed SemVP, that enables the system to read a reduced set of rows instead of entire partitions. SPARTI proposes a budgeting mechanism with a cost model to determine the worthiness of partitioning. Using real and synthetic datasets, SPARTI is compared against a Spark-based state-of-the-art system and is shown to execute queries around half the time over all query shapes while maintaining around an order of magnitude enhancement in storage requirements.","PeriodicalId":210506,"journal":{"name":"Proceedings of the International Workshop on Semantic Big Data","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114600973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Extending Apache Spark with a Mediation Layer 用中介层扩展Apache Spark

Proceedings of the International Workshop on Semantic Big Data

Pub Date : 2018-06-10 DOI: 10.1145/3208352.3208354

Dimitris Stripelis, Chrysovalantis Anastasiou, J. Ambite

With the recent growth of data volumes in many disciplines of both industry and academia many new Big Data Management systems have emerged to provide scalable tools for efficient data storing, processing and analysis. However, most of these systems offer little support for efficiently integrating multiple external sources under a uniform schema and a single query access point, which greatly simplifies further analytics. In this work, we present Spark Mediator, a system that extends the logical data integration capabilities of Apache Spark. As a use case, we show the application of Spark Mediator to the integration of schizophrenia neuroimaging data and compare with previous data integration systems.

随着近年来工业界和学术界许多学科数据量的增长，出现了许多新的大数据管理系统，为有效的数据存储、处理和分析提供了可扩展的工具。然而，这些系统中的大多数都不支持在统一模式和单个查询访问点下有效地集成多个外部源，这极大地简化了进一步的分析。在这项工作中，我们提出了Spark Mediator，一个扩展Apache Spark的逻辑数据集成功能的系统。作为一个用例，我们展示了Spark Mediator在精神分裂症神经成像数据集成中的应用，并与以前的数据集成系统进行了比较。

引用次数: 3

Timestamp-based Integrity Proofs for Linked Data 关联数据的基于时间戳的完整性证明

Proceedings of the International Workshop on Semantic Big Data

Pub Date : 2018-06-10 DOI: 10.1145/3208352.3208353

Andrew Sutton, Reza Samavi

In this paper, we first investigate the state-of-the-art methods of generating cryptographic hashes that can be used as an integrity proof for RDF datasets. We then propose an efficient method of computing integrity proofs for Linked Data that constructs a sorted Merkle tree for growing RDF datasets based on timestamps (as a key) that are semantically extractable from the RDF dataset. We evaluate our method by comparing it to existing methods and investigating its resistance to common security threats.

在本文中，我们首先研究了最先进的生成加密哈希的方法，这些哈希可以用作RDF数据集的完整性证明。然后，我们提出了一种有效的计算关联数据完整性证明的方法，该方法构建了一个排序的Merkle树，用于基于时间戳(作为键)增长RDF数据集，这些数据集在语义上可从RDF数据集中提取。我们通过将我们的方法与现有方法进行比较并调查其对常见安全威胁的抵抗力来评估我们的方法。

引用次数: 2

Stream WatDiv: A Streaming RDF Benchmark 流WatDiv:一个流RDF基准

Proceedings of the International Workshop on Semantic Big Data

Pub Date : 2018-06-10 DOI: 10.1145/3208352.3208355

Libo Gao, Lukasz Golab, M. Tamer Özsu, Günes Aluç

We present Stream WatDiv -- an open-source benchmark for streaming RDF data management systems. The proposed benchmark extends the existing WatDiv benchmark, and includes a streaming data generator, a query generator that can produce a diverse set of SPARQL queries, and a testbed to monitor correctness and latency. We use Stream WatDiv to evaluate two popular streaming RDF engines: C-SPARQL and CQUELS. With the diverse set of queries that can be generated by Stream WatDiv, we demonstrate new insights into the behaviour and performance of these systems.

我们提出Stream WatDiv——一个流RDF数据管理系统的开源基准。建议的基准测试扩展了现有的WatDiv基准测试，并包括一个流数据生成器、一个可以生成各种SPARQL查询集的查询生成器和一个用于监视正确性和延迟的测试平台。我们使用Stream WatDiv来评估两种流行的流RDF引擎:C-SPARQL和CQUELS。通过Stream WatDiv可以生成的各种查询集，我们展示了对这些系统的行为和性能的新见解。

引用次数: 10

TrueWeb TrueWeb

Proceedings of the International Workshop on Semantic Big Data

Pub Date : 2018-06-10 DOI: 10.1145/3208352.3208357

Amgad Madkour, Walid G. Aref, Sunil Prabhakar, Mohamed S. Ali, Siarhei Bykau

We envision a responsible web environment, termed TrueWeb, where a user should be able to find out whether any sentence he or she encounters in the web is true or false. The user should be able to track the provenance of any sentence or paragraph in the web. The target of TrueWeb is to compose factual knowledge from Internet resources about any subject of interest and present the collected knowledge in chronological order and distribute facts spatially and temporally as well as assign some belief factor for each fact. Another important target of TrueWeb is to be able to identify whether a statement in the Internet is true or false. The aim is to create an Internet infrastructure that, for each piece of published information, will be able to identify the truthfulness (or the degree of truthfulness) of that piece of information.

引用次数: 2

Using semantic web technologies to power LungMAP, a molecular data repository 使用语义web技术为LungMAP(一个分子数据存储库)提供动力

Proceedings of the International Workshop on Semantic Big Data

Pub Date : 2017-05-19 DOI: 10.1145/3066911.3066916

Michelle C. Krzyzanowski, Josh Levy, G. Page, N. Gaddis, R. Clark

As scientific research evolves, data continue to grow at an exponential rate. This growth calls for a need for more data repositories to store the data, and the creation of additional centralized repositories to provide standards for researchers. Common data repositories allow for collaboration and easier sharing of data, critical for further advancement of scientific understanding of a variety of topics. LungMAP (the Molecular Atlas of Lung Development) is an open-access reference resource that provides a comprehensive molecular atlas of the normal developing lung in humans and mice and provides data and reagents to the research community. The database utilizes RDF, SPARQL, and OWL. LungMAP exemplifies the use of semantic web technologies to provide a collaborative and open access data application for the scientific research community.

随着科学研究的发展，数据继续以指数速度增长。这种增长要求需要更多的数据存储库来存储数据，并创建额外的集中存储库来为研究人员提供标准。公共数据存储库允许协作和更容易地共享数据，这对于进一步推进对各种主题的科学理解至关重要。Lung map(肺发育分子图谱)是一个开放获取的参考资源，它提供了人类和小鼠正常发育肺的全面分子图谱，并为研究界提供数据和试剂。数据库使用RDF、SPARQL和OWL。LungMAP举例说明了使用语义网技术为科研团体提供协作和开放访问的数据应用程序。

引用次数: 2

Extracting linked data from statistic spreadsheets 从统计电子表格中提取关联数据

Proceedings of the International Workshop on Semantic Big Data

Pub Date : 2017-05-19 DOI: 10.1145/3066911.3066914

Tien-Duc Cao, I. Manolescu, Xavier Tannier

Statistic data is an important sub-category of open data; it is interesting for many applications, including but not limited to data journalism, as such data is typically of high quality, and reflects (under an aggregated form) important aspects of a society's life such as births, immigration, economy etc. However, such open data is often not published as Linked Open Data (LOD) limiting its usability. We provide a conceptual model for the open data comprised in statistics published by INSEE, the national French economic and societal statistics institute. Then, we describe a novel method for extracting RDF LOD, to populate an instance of this model. We used our method to produce RDF data out of 20k+ Excel spreadsheets, and our validation indicates a 91% rate of successful extraction.

统计数据是开放数据的一个重要子类;这对许多应用来说都很有趣，包括但不限于数据新闻，因为这些数据通常是高质量的，并且(以汇总形式)反映了一个社会生活的重要方面，如出生、移民、经济等。然而，这些开放数据通常不作为链接开放数据(LOD)发布，限制了其可用性。我们为法国国家经济和社会统计机构INSEE公布的统计数据中包含的开放数据提供了一个概念模型。然后，我们描述了一种提取RDF LOD的新方法，以填充该模型的实例。我们使用我们的方法从20k+ Excel电子表格中生成RDF数据，我们的验证表明成功提取的比率为91%。

引用次数: 17

On data placement strategies in distributed RDF stores 分布式RDF存储中的数据放置策略

Proceedings of the International Workshop on Semantic Big Data

Pub Date : 2017-05-19 DOI: 10.1145/3066911.3066915

Daniel Janke, Steffen Staab, Matthias Thimm

In the last years, scalable RDF stores in the cloud have been developed, where graph data is distributed over compute and storage nodes for scaling efforts of query processing and memory needs. One main challenge in these RDF stores is the data placement strategy that can be formalized in terms of graph covers. These graph covers determine whether (a) different query results may be computed on several compute nodes in parallel (vertical parallelization) and (b) individual query results can be produced only from triples assigned to few --- ideally one --- storage node (horizontal containment). We analyse the impact of three most commonly used graph cover strategies in these terms and found out that balancing query workload reduces the query execution time more than reducing data transfer over network. To this end, we present our novel benchmark and open source evaluation platform.

在过去几年中，已经开发了云中的可伸缩RDF存储，其中图形数据分布在计算和存储节点上，以便扩展查询处理和内存需求。这些RDF存储中的一个主要挑战是数据放置策略，该策略可以根据图覆盖进行形式化。这些图涵盖确定(a)是否可以在多个并行计算节点上计算不同的查询结果(垂直并行化)以及(b)单个查询结果只能从分配给几个(理想情况下是一个)存储节点的三元组中产生(水平包容)。我们分析了三种最常用的图覆盖策略在这些方面的影响，发现平衡查询工作负载比减少网络上的数据传输更能减少查询执行时间。为此，我们提出了新的基准和开源评估平台。

引用次数: 7

Evolution of anatomical concept usage over time: mining 200 years of biodiversity literature 解剖学概念使用的演变:挖掘200年的生物多样性文献

Proceedings of the International Workshop on Semantic Big Data

Pub Date : 2017-05-19 DOI: 10.1145/3066911.3066919

Prashanti Manda, T. Vision

The scientific literature contains an historic record of the changing ways in which we describe the world. Shifts in understanding of scientific concepts are reflected in the introduction of new terms and the changing usage and context of existing ones. We conducted an ontology-based temporal data mining analysis of biodiversity literature from the 1700s to 2000s to quantitatively measure how the context of usage for vertebrate anatomical concepts has changed over time. The corpus of literature was divided into nine non-overlapping time periods with comparable amounts of data and context vectors of anatomical concepts were compared to measure the magnitude of concept drift both between adjacent time periods and cumulatively relative to the initial state. Surprisingly, we found that while anatomical concept drift between adjacent time periods was substantial (55% to 68%), it was of the same magnitude as cumulative concept drift across multiple time periods. Such a process, bound by an overall mean drift, fits the expectations of a mean-reverting process.

科学文献包含了我们描述世界的方式不断变化的历史记录。对科学概念理解的转变反映在新术语的引入和现有术语的用法和背景的变化上。我们对18世纪至21世纪的生物多样性文献进行了基于本体论的时间数据挖掘分析，以定量衡量脊椎动物解剖学概念的使用背景如何随时间变化。将文献语料库划分为9个不重叠的时间段，并比较解剖学概念的上下文向量，以测量相邻时间段之间和相对于初始状态的累积概念漂移的程度。令人惊讶的是，我们发现，虽然相邻时间段之间的解剖学概念漂移是相当大的(55%至68%)，但其大小与跨多个时间段的累积概念漂移相同。这样一个过程，受总体均值漂移的约束，符合均值回归过程的预期。

引用次数: 0

A distributed graph approach for pre-processing linked RDF data using supercomputers 使用超级计算机预处理链接RDF数据的分布式图方法

Proceedings of the International Workshop on Semantic Big Data

Pub Date : 2017-05-19 DOI: 10.1145/3066911.3066913

M. Lewis, G. Thiruvathukal, V. Vishwanath, M. Papka, Andrew E. Johnson

Efficient RDF, graph based queries are becoming more pertinent based on the increased interest in data analytics and its intersection with large, unstructured but connected data. Many commercial systems have adopted distributed RDF graph systems in order to handle increasing dataset sizes and complex queries. This paper introduces a distribute graph approach to pre-processing linked data. Instead of traversing the memory graph, our system indexes pre-processed join elements that are organized in a graph structure. We analyze the Dbpedia data-set (derived from the Wikipedia corpus) and compare our access method to the graph traversal access approach which we also devise. Results show from our experiments that the distributed, pre-processed graph approach to accessing linked data is faster than the traversal approach over a specific range of linked queries.

高效的RDF、基于图的查询正变得越来越有针对性，这是基于对数据分析的兴趣的增加，以及它与大型、非结构化但连接的数据的交集。许多商业系统采用分布式RDF图系统来处理不断增长的数据集大小和复杂的查询。本文介绍了一种分布式图方法对关联数据进行预处理。我们的系统不是遍历内存图，而是索引在图结构中组织的预处理连接元素。我们分析了Dbpedia数据集(来自Wikipedia语料库)，并将我们的访问方法与我们设计的图遍历访问方法进行了比较。我们的实验结果表明，在特定范围的链接查询中，分布式、预处理的图方法访问链接数据的速度比遍历方法快。

引用次数: 1

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the International Workshop on Semantic Big Data

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀