This study aims to bridge the gap between subjective literary criticism and natural language processing by creating a model that emulates the results of a survey into literary tastes. A panel of human experts qualified segments of literary text according to how aesthetically pleasing they found them. These segments were then rated for literariness in an open survey using a Likert scale. Each segment was subjected to a parts-of-speech tagger using NLTK and the results compared with those of the survey. Using a Grounded Theory approach, experiments using various combinations of parts-of-speech were carried out in order to build a model that could replicate the results shown in the open survey. The success of this approach confirms the feasibility of using this method to create a more accurate and analytical model of literary criticism involving deeper stylistic markers.
{"title":"Towards a model for replicating aesthetic literary appreciation","authors":"T. Crosbie, Timothy French, M. Conrad","doi":"10.1145/2484712.2484720","DOIUrl":"https://doi.org/10.1145/2484712.2484720","url":null,"abstract":"This study aims to bridge the gap between subjective literary criticism and natural language processing by creating a model that emulates the results of a survey into literary tastes. A panel of human experts qualified segments of literary text according to how aesthetically pleasing they found them. These segments were then rated for literariness in an open survey using a Likert scale. Each segment was subjected to a parts-of-speech tagger using NLTK and the results compared with those of the survey. Using a Grounded Theory approach, experiments using various combinations of parts-of-speech were carried out in order to build a model that could replicate the results shown in the open survey. The success of this approach confirms the feasibility of using this method to create a more accurate and analytical model of literary criticism involving deeper stylistic markers.","PeriodicalId":420849,"journal":{"name":"SWIM '13","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123135610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
There is a large amount of free health information available for a patient to address her health concerns. HealthData.gov includes community health datasets at the national, state and community level, readily downloadable. There are also patient-generated datasets, accessible through social media, on the conditions, treatments or side effects that individual patients experience. While caring for patients, clinicians or healthcare providers may benefit from integrated information and knowledge embedded in the open health datasets, such as national health trends and social health trends from patient-generated healthcare experiences. However, the open health datasets are distributed and vary from structured to highly unstructured. An information seeker has to spend time visiting many, possibly irrelevant, websites, and has to select relevant information from each and integrate it into a coherent mental model. In this paper, we present a Linked Data approach to integrating these health data sources and presenting contextually relevant information called Social InfoButtons to healthcare professionals and patients. We present methods of data extraction, and semantic linked data integration and visualization. A Social InfoButtons prototype system provides awareness of community and patient health issues and healthcare trends that may shed light on patient care and health policy decisions.
{"title":"Social infobuttons: integrating open health data with social data using semantic technology","authors":"Xiang Ji, Soon Ae Chun, J. Geller","doi":"10.1145/2484712.2484718","DOIUrl":"https://doi.org/10.1145/2484712.2484718","url":null,"abstract":"There is a large amount of free health information available for a patient to address her health concerns. HealthData.gov includes community health datasets at the national, state and community level, readily downloadable. There are also patient-generated datasets, accessible through social media, on the conditions, treatments or side effects that individual patients experience. While caring for patients, clinicians or healthcare providers may benefit from integrated information and knowledge embedded in the open health datasets, such as national health trends and social health trends from patient-generated healthcare experiences. However, the open health datasets are distributed and vary from structured to highly unstructured. An information seeker has to spend time visiting many, possibly irrelevant, websites, and has to select relevant information from each and integrate it into a coherent mental model. In this paper, we present a Linked Data approach to integrating these health data sources and presenting contextually relevant information called Social InfoButtons to healthcare professionals and patients. We present methods of data extraction, and semantic linked data integration and visualization. A Social InfoButtons prototype system provides awareness of community and patient health issues and healthcare trends that may shed light on patient care and health policy decisions.","PeriodicalId":420849,"journal":{"name":"SWIM '13","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129883441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We have investigated approaches for scalable reconstruction of relational databases (RDBs) archived as RDF files. An archived RDB is reconstructed from a data archive file and a schema archive file, both in N-Triples formats. The archives contain RDF triples representing the archived relational data content and the relational schema describing the content, respectively. When an archived RDB is to be reconstructed, the schema archive is first read to automatically create the RDB schema using a schema reconstruction algorithm which identifies RDB elements by queries to the schema archive. The RDB thus created is then populated by reading the data archive. To populate the RDB we have developed two approaches, the naive Insert Attribute Value (IAV) and Triple Bulk Load (TBL). With the IAV approach the data is populated by stored procedures that execute SQL INSERT or UPDATE statements to insert attribute values in the RDB tables. In the more complex TBL approach the database is populated by bulk loading CSV files generated by sorting the data archive triples joined with schema information. Our experiments show that the TBL approach is substantially faster than the IAV approach.
{"title":"Scalable reconstruction of RDF-archived relational databases","authors":"S. Stefanova, T. Risch","doi":"10.1145/2484712.2484717","DOIUrl":"https://doi.org/10.1145/2484712.2484717","url":null,"abstract":"We have investigated approaches for scalable reconstruction of relational databases (RDBs) archived as RDF files. An archived RDB is reconstructed from a data archive file and a schema archive file, both in N-Triples formats. The archives contain RDF triples representing the archived relational data content and the relational schema describing the content, respectively. When an archived RDB is to be reconstructed, the schema archive is first read to automatically create the RDB schema using a schema reconstruction algorithm which identifies RDB elements by queries to the schema archive. The RDB thus created is then populated by reading the data archive. To populate the RDB we have developed two approaches, the naive Insert Attribute Value (IAV) and Triple Bulk Load (TBL). With the IAV approach the data is populated by stored procedures that execute SQL INSERT or UPDATE statements to insert attribute values in the RDB tables. In the more complex TBL approach the database is populated by bulk loading CSV files generated by sorting the data archive triples joined with schema information. Our experiments show that the TBL approach is substantially faster than the IAV approach.","PeriodicalId":420849,"journal":{"name":"SWIM '13","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124125527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
RDFS reasoning is carried out via joint terms of triples; accordingly, a distributed reasoning approach should bring together triples that have terms in common. To achieve this, term-based partitioning distributes triples to partitions based on the terms they include. However, skewed distribution of Semantic Web data results in unbalanced load distribution. A single peer should be able to handle even the largest partition, and this requirement limits scalability. This approach also suffers from data replication since a triple is sent to multiple partitions. In this paper, we propose a two-step method to overcome above limitations. Our RDFS specific term-based partitioning algorithm applies a selective distribution policy and distributes triples with minimum replication. Our schema-sensitive processing approach eliminates non-productive partitions, and enables processing of a partition regardless of its size. Resulting partitions reach full closure without repeating the global schema or without fix-point iteration as suggested by previous studies.
{"title":"Overcoming limitations of term-based partitioning for distributed RDFS reasoning","authors":"Tugba Kulahcioglu, Hasan Bulut","doi":"10.1145/2484712.2484719","DOIUrl":"https://doi.org/10.1145/2484712.2484719","url":null,"abstract":"RDFS reasoning is carried out via joint terms of triples; accordingly, a distributed reasoning approach should bring together triples that have terms in common. To achieve this, term-based partitioning distributes triples to partitions based on the terms they include. However, skewed distribution of Semantic Web data results in unbalanced load distribution. A single peer should be able to handle even the largest partition, and this requirement limits scalability. This approach also suffers from data replication since a triple is sent to multiple partitions. In this paper, we propose a two-step method to overcome above limitations. Our RDFS specific term-based partitioning algorithm applies a selective distribution policy and distributes triples with minimum replication. Our schema-sensitive processing approach eliminates non-productive partitions, and enables processing of a partition regardless of its size. Resulting partitions reach full closure without repeating the global schema or without fix-point iteration as suggested by previous studies.","PeriodicalId":420849,"journal":{"name":"SWIM '13","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130048857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider the problem of query containment under ontological constraints, such as those of RDFS. Query containment, i.e., deciding whether the answers of a given query are always contained in the answers of another query, is an important problem to areas such as database theory and knowledge representation, with applications to data integration, query optimization and minimization. We consider unions of conjunctive queries, which constitute the core of structured query languages, such as SPARQL and SQL. We also consider ontological constraints or axioms, expressed in the language of Tuple-Generating Dependencies. TGDs capture RDF/S and fragments of Description Logics. We consider classes of TGDs for which the chase is known to terminate. Query containment under chase-terminating axioms can be decided by first running the chase on one of the two queries and then rely on classic relational containment. When considering unions of conjunctive queries, classic algorithms for both the chase and containment phases suffer from a large degree of redundancy. We leverage a graph-based modeling of rules, that represents multiple queries in a compact form, by exploiting shared patterns amongst them. As a result we couple the phases of both for chase and regular containment and end up with a faster and more scalable algorithm. Our experiments show a speedup of close to two orders of magnitude.
{"title":"Scalable containment for unions of conjunctive queries under constraints","authors":"G. Konstantinidis, J. Ambite","doi":"10.1145/2484712.2484716","DOIUrl":"https://doi.org/10.1145/2484712.2484716","url":null,"abstract":"We consider the problem of query containment under ontological constraints, such as those of RDFS. Query containment, i.e., deciding whether the answers of a given query are always contained in the answers of another query, is an important problem to areas such as database theory and knowledge representation, with applications to data integration, query optimization and minimization. We consider unions of conjunctive queries, which constitute the core of structured query languages, such as SPARQL and SQL. We also consider ontological constraints or axioms, expressed in the language of Tuple-Generating Dependencies. TGDs capture RDF/S and fragments of Description Logics. We consider classes of TGDs for which the chase is known to terminate. Query containment under chase-terminating axioms can be decided by first running the chase on one of the two queries and then rely on classic relational containment. When considering unions of conjunctive queries, classic algorithms for both the chase and containment phases suffer from a large degree of redundancy. We leverage a graph-based modeling of rules, that represents multiple queries in a compact form, by exploiting shared patterns amongst them. As a result we couple the phases of both for chase and regular containment and end up with a faster and more scalable algorithm. Our experiments show a speedup of close to two orders of magnitude.","PeriodicalId":420849,"journal":{"name":"SWIM '13","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131569680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Open Data Protocol (OData) is a data access protocol that is based on the REST principles. It is built upon existing and well-known technologies such as HTTP, AtomPub and JSON. OData is already widely-used in the industry. Many IT companies provide OData interfaces for their applications. The structure of the data that is provided by an OData service is described with the Conceptual Schema Definition Language (CSDL). To make this data available for the integration with the Semantic Web, we propose to semantically annotate CSDL-documents. This extension of CSDL allows the definition of mappings from the underlying Entity Data Model (EDM) to RDF graphs which is a first step towards the implementation of a SPARQL endpoint on top of existing OData services. Based on the OData interfaces of existing enterprise resource planning (ERP) systems, it is possible to realize a SPARQL endpoint for those systems which can lead to a great simplification in the retrieval of data.
{"title":"Semantic description of OData services","authors":"M. Kirchhoff, K. Geihs","doi":"10.1145/2484712.2484714","DOIUrl":"https://doi.org/10.1145/2484712.2484714","url":null,"abstract":"The Open Data Protocol (OData) is a data access protocol that is based on the REST principles. It is built upon existing and well-known technologies such as HTTP, AtomPub and JSON. OData is already widely-used in the industry. Many IT companies provide OData interfaces for their applications. The structure of the data that is provided by an OData service is described with the Conceptual Schema Definition Language (CSDL). To make this data available for the integration with the Semantic Web, we propose to semantically annotate CSDL-documents. This extension of CSDL allows the definition of mappings from the underlying Entity Data Model (EDM) to RDF graphs which is a first step towards the implementation of a SPARQL endpoint on top of existing OData services. Based on the OData interfaces of existing enterprise resource planning (ERP) systems, it is possible to realize a SPARQL endpoint for those systems which can lead to a great simplification in the retrieval of data.","PeriodicalId":420849,"journal":{"name":"SWIM '13","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128229921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Schätzle, Antony Neu, G. Lausen, Martin Przyjaciel-Zablocki
RDF datasets with billions of triples are no longer unusual and continue to grow constantly (e.g. LOD cloud) driven by the inherent flexibility of RDF that allows to represent very diverse datasets, ranging from highly structured to unstructured data. Because of their size, understanding and processing RDF graphs is often a difficult task and methods to reduce the size while keeping as much of its structural information become attractive. In this paper we study bisimulation as a means to reduce the size of RDF graphs according to structural equivalence. We study two bisimulation algorithms, one for sequential execution using SQL and one for distributed execution using MapReduce. We demonstrate that the MapReduce-based implementation scales linearly with the number of the RDF triples, allowing to compute the bisimulation of very large RDF graphs within a time which is by far not possible for the sequential version. Experiments based on synthetic benchmark data and real data (DBPedia) exhibit a reduction of more than 90% of the size of the RDF graph in terms of the number of nodes to the number of blocks in the resulting bisimulation partition.
{"title":"Large-scale bisimulation of RDF graphs","authors":"A. Schätzle, Antony Neu, G. Lausen, Martin Przyjaciel-Zablocki","doi":"10.1145/2484712.2484713","DOIUrl":"https://doi.org/10.1145/2484712.2484713","url":null,"abstract":"RDF datasets with billions of triples are no longer unusual and continue to grow constantly (e.g. LOD cloud) driven by the inherent flexibility of RDF that allows to represent very diverse datasets, ranging from highly structured to unstructured data. Because of their size, understanding and processing RDF graphs is often a difficult task and methods to reduce the size while keeping as much of its structural information become attractive. In this paper we study bisimulation as a means to reduce the size of RDF graphs according to structural equivalence. We study two bisimulation algorithms, one for sequential execution using SQL and one for distributed execution using MapReduce. We demonstrate that the MapReduce-based implementation scales linearly with the number of the RDF triples, allowing to compute the bisimulation of very large RDF graphs within a time which is by far not possible for the sequential version. Experiments based on synthetic benchmark data and real data (DBPedia) exhibit a reduction of more than 90% of the size of the RDF graph in terms of the number of nodes to the number of blocks in the resulting bisimulation partition.","PeriodicalId":420849,"journal":{"name":"SWIM '13","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129646836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rogers Reiche de Mendonça, Sérgio Manuel Serra da Cruz, Jonas F. S. M. de La Cerda, M. C. Cavalcanti, K. F. Cordeiro, M. Campos
The Web of Data has emerged as a means to expose, share, reuse, and connect information on the Web identified by URIs using RDF as a data model, following Linked Data Principles. However, the reuse of third party data can be compromised without proper data quality assessments. In this context, important questions emerge: how can one trust on published data and links? Which manipulation, modification and integration operations have been applied to the data before its publication? What is the nature of comparisons or transformations applied to data during the interlinking process? In this scenario, provenance becomes a fundamental element. In this paper, we describe an approach for generating and capturing Linked Open Provenance (LOP) to support data quality and trustworthiness assessments, which covers preparation and format transformation of traditional data sources, up to dataset publication and interlinking. The proposed architecture takes advantage of provenance agents, orchestrated by an ETL workflow approach, to collect provenance at any specified level and also link it with its corresponding data. We also describe a real use case scenario where the architecture was implemented to evaluate the proposal.
{"title":"LOP: capturing and linking open provenance on LOD cycle","authors":"Rogers Reiche de Mendonça, Sérgio Manuel Serra da Cruz, Jonas F. S. M. de La Cerda, M. C. Cavalcanti, K. F. Cordeiro, M. Campos","doi":"10.1145/2484712.2484715","DOIUrl":"https://doi.org/10.1145/2484712.2484715","url":null,"abstract":"The Web of Data has emerged as a means to expose, share, reuse, and connect information on the Web identified by URIs using RDF as a data model, following Linked Data Principles. However, the reuse of third party data can be compromised without proper data quality assessments. In this context, important questions emerge: how can one trust on published data and links? Which manipulation, modification and integration operations have been applied to the data before its publication? What is the nature of comparisons or transformations applied to data during the interlinking process? In this scenario, provenance becomes a fundamental element. In this paper, we describe an approach for generating and capturing Linked Open Provenance (LOP) to support data quality and trustworthiness assessments, which covers preparation and format transformation of traditional data sources, up to dataset publication and interlinking. The proposed architecture takes advantage of provenance agents, orchestrated by an ETL workflow approach, to collect provenance at any specified level and also link it with its corresponding data. We also describe a real use case scenario where the architecture was implemented to evaluate the proposal.","PeriodicalId":420849,"journal":{"name":"SWIM '13","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127237527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}