In this paper we describe the process of building indices for semantic search using MapReduce. We compare the two most straightforward representations of RDF data, the horizontal index structure using parallel indices and the vertical index structure using fields. We measure the cost of building indices and also compare retrieval performance on keyword queries and queries restricted to particular properties.
{"title":"Distributed indexing for semantic search","authors":"P. Mika","doi":"10.1145/1863879.1863882","DOIUrl":"https://doi.org/10.1145/1863879.1863882","url":null,"abstract":"In this paper we describe the process of building indices for semantic search using MapReduce. We compare the two most straightforward representations of RDF data, the horizontal index structure using parallel indices and the vertical index structure using fields. We measure the cost of building indices and also compare retrieval performance on keyword queries and queries restricted to particular properties.","PeriodicalId":239913,"journal":{"name":"SEMSEARCH '10","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134021269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Waitelonis, Harald Sack, Johannes Hercher, Zalan Kramer
With the exponential growth of video data on the World Wide Web comes the challenge of efficient methods in video content management, content-based video search, filtering and browsing. But, video data often lacks sufficient meta-data to open up the video content and to enable pinpoint content-based search. With the advent of the 'web of data' as an extension of the current WWW new data sources can be exploited by semantically interconnecting video meta-data with the web of data. Thus, enabling better access to video repositories by deploying semantic search technologies and improving the user's search experience by supporting exploratory search strategies. We have developed the prototype semantic video search engine 'yovisto' that demonstrates the advantages of semantically enhanced exploratory video search and enables investigative navigation and browsing in large video repositories.
{"title":"Semantically enabled exploratory video search","authors":"J. Waitelonis, Harald Sack, Johannes Hercher, Zalan Kramer","doi":"10.1145/1863879.1863887","DOIUrl":"https://doi.org/10.1145/1863879.1863887","url":null,"abstract":"With the exponential growth of video data on the World Wide Web comes the challenge of efficient methods in video content management, content-based video search, filtering and browsing. But, video data often lacks sufficient meta-data to open up the video content and to enable pinpoint content-based search. With the advent of the 'web of data' as an extension of the current WWW new data sources can be exploited by semantically interconnecting video meta-data with the web of data. Thus, enabling better access to video repositories by deploying semantic search technologies and improving the user's search experience by supporting exploratory search strategies. We have developed the prototype semantic video search engine 'yovisto' that demonstrates the advantages of semantically enhanced exploratory video search and enables investigative navigation and browsing in large video repositories.","PeriodicalId":239913,"journal":{"name":"SEMSEARCH '10","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133985491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Imielinski, Jinyun Yan, Yihan Fang, Kurt Eldridge, Huiwen Yu, Peter Kelly
Paraphrasing is the restatement (or reuse) of text which preserves its meaning in another form. A para-query is a para-phrase of a search query. Humans easily recognize para-queries, but search engines are still far away from it. We claim that in order for a search engine to be called semantic it is necessary that it recognizes para-queries by returning the same search results for all para-queries of a given query. Recognizing para-queries is an important and desired ability of a search engine. It can relieve users of the burden of rephrasing queries in order to improve the relevance of results. In this paper, we cover two main threads: monolingual para-query generation (PG) and para-query recognition measurement (PRM). Para-query generation aims to automatically generate as many English para-queries as possible for a given query. We propose a novel game "Rephraser" to tackle this problem. Hundreds of para-query templates are extracted from the game's output and used to compose tens of thousands of para-queries. The goal of para-query recognition measurement is to examine to what level search engines recognize para-queries. We propose the concept of paraphrasing invariance coefficient (PIC) which is defined as the probability that search results are the same for a pair of para-queries. By using para-queries generated from the game, we design experiments to measure search engines' PIC. Results show that today's leading search engines are still inferior to human ability in recognizing para-queries. It is a long way ahead for search to be truly semantic.
{"title":"Paraphrasing invariance coefficient: measuring para-query invariance of search engines","authors":"T. Imielinski, Jinyun Yan, Yihan Fang, Kurt Eldridge, Huiwen Yu, Peter Kelly","doi":"10.1145/1863879.1863880","DOIUrl":"https://doi.org/10.1145/1863879.1863880","url":null,"abstract":"Paraphrasing is the restatement (or reuse) of text which preserves its meaning in another form. A para-query is a para-phrase of a search query. Humans easily recognize para-queries, but search engines are still far away from it. We claim that in order for a search engine to be called semantic it is necessary that it recognizes para-queries by returning the same search results for all para-queries of a given query. Recognizing para-queries is an important and desired ability of a search engine. It can relieve users of the burden of rephrasing queries in order to improve the relevance of results.\u0000 In this paper, we cover two main threads: monolingual para-query generation (PG) and para-query recognition measurement (PRM). Para-query generation aims to automatically generate as many English para-queries as possible for a given query. We propose a novel game \"Rephraser\" to tackle this problem. Hundreds of para-query templates are extracted from the game's output and used to compose tens of thousands of para-queries.\u0000 The goal of para-query recognition measurement is to examine to what level search engines recognize para-queries. We propose the concept of paraphrasing invariance coefficient (PIC) which is defined as the probability that search results are the same for a pair of para-queries. By using para-queries generated from the game, we design experiments to measure search engines' PIC. Results show that today's leading search engines are still inferior to human ability in recognizing para-queries. It is a long way ahead for search to be truly semantic.","PeriodicalId":239913,"journal":{"name":"SEMSEARCH '10","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123599784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we describe a system that automatically extracts quotations from news feeds, and allows efficient retrieval of the semantically annotated quotes. APIs for real-time querying of over 10 million quotes extracted from recent news feeds are publicly available. In addition, each day we add around 60 thousand new quotes extracted from around 50 thousand news articles or blogs. We apply computational linguistic techniques such as coreference resolution, entity recognition and disambiguation to improve both precision and recall of the quote detection. We support faceted search on both speakers and entities mentioned in the quotes.
{"title":"A large-scale system for annotating and querying quotations in news feeds","authors":"Jisheng Liang, Navdeep Dhillon, K. Koperski","doi":"10.1145/1863879.1863886","DOIUrl":"https://doi.org/10.1145/1863879.1863886","url":null,"abstract":"In this paper, we describe a system that automatically extracts quotations from news feeds, and allows efficient retrieval of the semantically annotated quotes. APIs for real-time querying of over 10 million quotes extracted from recent news feeds are publicly available. In addition, each day we add around 60 thousand new quotes extracted from around 50 thousand news articles or blogs. We apply computational linguistic techniques such as coreference resolution, entity recognition and disambiguation to improve both precision and recall of the quote detection. We support faceted search on both speakers and entities mentioned in the quotes.","PeriodicalId":239913,"journal":{"name":"SEMSEARCH '10","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129380080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider the task of entity search and examine to which extent state-of-art information retrieval (IR) and semantic web (SW) technologies are capable of answering information needs that focus on entities. We also explore the potential of combining IR with SW technologies to improve the end-to-end performance on a specific entity search task. We arrive at and motivate a proposal to combine text-based entity models with semantic information from the Linked Open Data cloud.
{"title":"Entity search: building bridges between two worlds","authors":"K. Balog, E. Meij, M. de Rijke","doi":"10.1145/1863879.1863888","DOIUrl":"https://doi.org/10.1145/1863879.1863888","url":null,"abstract":"We consider the task of entity search and examine to which extent state-of-art information retrieval (IR) and semantic web (SW) technologies are capable of answering information needs that focus on entities. We also explore the potential of combining IR with SW technologies to improve the end-to-end performance on a specific entity search task. We arrive at and motivate a proposal to combine text-based entity models with semantic information from the Linked Open Data cloud.","PeriodicalId":239913,"journal":{"name":"SEMSEARCH '10","volume":"34 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133069499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Search Engines have become the main entry point to Web content, and a large part of the "visible" Web consists in what is presented by them as top retrieved results. Therefore, it would be desirable if the first few results were a representative sample of the entire result set. This paper provides a preliminary study about opinions contained in search engine results for controversial queries such as "cloning" or "immigration". To this end, we extract sentiment metadata from web pages, and compare search engine results for several queries. Furthermore, we compare opinions expressed in the top results to those in other retrieved results to examine whether the top-ranked pages are a good sample of all results from an opinion perspective. In a preliminary empirical analysis, we compare up to 50 results from 3 commercial search engines on 14 controversial queries to study the relation between sentiments, topics, and rankings.
{"title":"Dear search engine: what's your opinion about...?: sentiment analysis for semantic enrichment of web search results","authors":"Gianluca Demartini, Stefan Siersdorfer","doi":"10.1145/1863879.1863883","DOIUrl":"https://doi.org/10.1145/1863879.1863883","url":null,"abstract":"Search Engines have become the main entry point to Web content, and a large part of the \"visible\" Web consists in what is presented by them as top retrieved results. Therefore, it would be desirable if the first few results were a representative sample of the entire result set. This paper provides a preliminary study about opinions contained in search engine results for controversial queries such as \"cloning\" or \"immigration\". To this end, we extract sentiment metadata from web pages, and compare search engine results for several queries. Furthermore, we compare opinions expressed in the top results to those in other retrieved results to examine whether the top-ranked pages are a good sample of all results from an opinion perspective. In a preliminary empirical analysis, we compare up to 50 results from 3 commercial search engines on 14 controversial queries to study the relation between sentiments, topics, and rankings.","PeriodicalId":239913,"journal":{"name":"SEMSEARCH '10","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116622888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Wrigley, D. Reinhard, Khadija Elbedweihy, A. Bernstein, F. Ciravegna
The main problem with the state of the art in the semantic search domain is the lack of comprehensive evaluations. There exist only a few efforts to evaluate semantic search tools and to compare the results with other evaluations of their kind. In this paper, we present a systematic approach for testing and benchmarking semantic search tools that was developed within the SEALS project. Unlike other semantic web evaluations our methodology tests search tools both automatically and interactively with a human user in the loop. This allows us to test not only functional performance measures, such as precision and recall, but also usability issues, such as ease of use and comprehensibility of the query language. The paper describes the evaluation goals and assumptions; the criteria and metrics; the type of experiments we will conduct as well as the datasets required to conduct the evaluation in the context of the SEALS initiative. To our knowledge it is the first effort to present a comprehensive evaluation methodology for Semantic Web search tools.
{"title":"Methodology and campaign design for the evaluation of semantic search tools","authors":"S. Wrigley, D. Reinhard, Khadija Elbedweihy, A. Bernstein, F. Ciravegna","doi":"10.1145/1863879.1863889","DOIUrl":"https://doi.org/10.1145/1863879.1863889","url":null,"abstract":"The main problem with the state of the art in the semantic search domain is the lack of comprehensive evaluations. There exist only a few efforts to evaluate semantic search tools and to compare the results with other evaluations of their kind.\u0000 In this paper, we present a systematic approach for testing and benchmarking semantic search tools that was developed within the SEALS project. Unlike other semantic web evaluations our methodology tests search tools both automatically and interactively with a human user in the loop. This allows us to test not only functional performance measures, such as precision and recall, but also usability issues, such as ease of use and comprehensibility of the query language.\u0000 The paper describes the evaluation goals and assumptions; the criteria and metrics; the type of experiments we will conduct as well as the datasets required to conduct the evaluation in the context of the SEALS initiative. To our knowledge it is the first effort to present a comprehensive evaluation methodology for Semantic Web search tools.","PeriodicalId":239913,"journal":{"name":"SEMSEARCH '10","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129868292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
José R. Pérez-Agüera, Javier Arroyo, J. Greenberg, Joaquín Pérez-Iglesias, Víctor Fresno-Fernández
Information Retrieval (IR) approaches for semantic web search engines have become very populars in the last years. Popularization of different IR libraries, like Lucene, that allows IR implementations almost out-of-the-box have make easier IR integration in Semantic Web search engines. However, one of the most important features of Semantic Web documents is the structure, since this structure allow us to represent semantic in a machine readable format. In this paper we analyze the specific problems of structured IR and how to adapt weighting schemas for semantic document retrieval.
{"title":"Using BM25F for semantic search","authors":"José R. Pérez-Agüera, Javier Arroyo, J. Greenberg, Joaquín Pérez-Iglesias, Víctor Fresno-Fernández","doi":"10.1145/1863879.1863881","DOIUrl":"https://doi.org/10.1145/1863879.1863881","url":null,"abstract":"Information Retrieval (IR) approaches for semantic web search engines have become very populars in the last years. Popularization of different IR libraries, like Lucene, that allows IR implementations almost out-of-the-box have make easier IR integration in Semantic Web search engines. However, one of the most important features of Semantic Web documents is the structure, since this structure allow us to represent semantic in a machine readable format. In this paper we analyze the specific problems of structured IR and how to adapt weighting schemas for semantic document retrieval.","PeriodicalId":239913,"journal":{"name":"SEMSEARCH '10","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114899495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Although one might argue that little wisdom can be conveyed in messages of 140 characters or less, this paper sets out to explore whether the aggregation of messages in social awareness streams, such as Twitter, conveys meaningful information about a given domain. As a research community, we know little about the structural and semantic properties of such streams, and how they can be analyzed, characterized and used. This paper introduces a network-theoretic model of social awareness stream, a so-called "tweetonomy", together with a set of stream-based measures that allow researchers to systematically define and compare different stream aggregations. We apply the model and measures to a dataset acquired from Twitter to study emerging semantics in selected streams. The network-theoretic model and the corresponding measures introduced in this paper are relevant for researchers interested in information retrieval and ontology learning from social awareness streams. Our empirical findings demonstrate that different social awareness stream aggregations exhibit interesting differences, making them amenable for different applications.
{"title":"The wisdom in tweetonomies: acquiring latent conceptual structures from social awareness streams","authors":"Claudia Wagner, M. Strohmaier","doi":"10.1145/1863879.1863885","DOIUrl":"https://doi.org/10.1145/1863879.1863885","url":null,"abstract":"Although one might argue that little wisdom can be conveyed in messages of 140 characters or less, this paper sets out to explore whether the aggregation of messages in social awareness streams, such as Twitter, conveys meaningful information about a given domain. As a research community, we know little about the structural and semantic properties of such streams, and how they can be analyzed, characterized and used. This paper introduces a network-theoretic model of social awareness stream, a so-called \"tweetonomy\", together with a set of stream-based measures that allow researchers to systematically define and compare different stream aggregations. We apply the model and measures to a dataset acquired from Twitter to study emerging semantics in selected streams. The network-theoretic model and the corresponding measures introduced in this paper are relevant for researchers interested in information retrieval and ontology learning from social awareness streams. Our empirical findings demonstrate that different social awareness stream aggregations exhibit interesting differences, making them amenable for different applications.","PeriodicalId":239913,"journal":{"name":"SEMSEARCH '10","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121484510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We have been developing a task-based service navigation system that offers to the user services relevant to the task the user wants to perform. The system allows the user to concretize his/her request in the task-model developed by human-experts. In this study, to reduce the cost of collecting a wide variety of activities, we investigate the automatic modeling of users' real world activities from the web. To extract the widest possible variety of activities with high precision and recall, we investigate the appropriate number of contents and resources to extract. Our results show that we do not need to examine the entire web, which is too time consuming; a limited number of search results (e.g. 900 from among 21,000,000 search results) from blog contents are needed. In addition, to estimate the hierarchical relationships present in the activity model with the lowest possible error rate, we propose a method that divides the representation of activities into a noun part and a verb part, and calculates the mutual information between them. The result shows almost 80% of the hierarchical relationships can be captured by the proposed method.
{"title":"Automatic modeling of user's real world activities from the web for semantic IR","authors":"Yusuke Fukazawa, J. Ota","doi":"10.1145/1863879.1863884","DOIUrl":"https://doi.org/10.1145/1863879.1863884","url":null,"abstract":"We have been developing a task-based service navigation system that offers to the user services relevant to the task the user wants to perform. The system allows the user to concretize his/her request in the task-model developed by human-experts. In this study, to reduce the cost of collecting a wide variety of activities, we investigate the automatic modeling of users' real world activities from the web. To extract the widest possible variety of activities with high precision and recall, we investigate the appropriate number of contents and resources to extract. Our results show that we do not need to examine the entire web, which is too time consuming; a limited number of search results (e.g. 900 from among 21,000,000 search results) from blog contents are needed. In addition, to estimate the hierarchical relationships present in the activity model with the lowest possible error rate, we propose a method that divides the representation of activities into a noun part and a verb part, and calculates the mutual information between them. The result shows almost 80% of the hierarchical relationships can be captured by the proposed method.","PeriodicalId":239913,"journal":{"name":"SEMSEARCH '10","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131264098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}