We give a survey over the INEX initiative, which focuses on the evaluation of content -based access to XML documents. First, we describe the test setting and the various tracks of INEX. Then we present a new framework for the different views on XML retrieval, where we distinguish between the structural and the content dimension; in this space, current activities are located as well as new areas of research are pointed out. Finally, we discuss the combination of semantic web technologies and XML retrieval, pointing out potential benefits as well as the need for further research in this area.
{"title":"Advances in XML retrieval: the INEX initiative","authors":"N. Fuhr, M. Lalmas","doi":"10.1145/1364742.1364763","DOIUrl":"https://doi.org/10.1145/1364742.1364763","url":null,"abstract":"We give a survey over the INEX initiative, which focuses on the evaluation of content -based access to XML documents. First, we describe the test setting and the various tracks of INEX. Then we present a new framework for the different views on XML retrieval, where we distinguish between the structural and the content dimension; in this space, current activities are located as well as new areas of research are pointed out. Finally, we discuss the combination of semantic web technologies and XML retrieval, pointing out potential benefits as well as the need for further research in this area.","PeriodicalId":287514,"journal":{"name":"International Workshop On Research Issues in Digital Libraries","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123851268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Search engines -- "web dragons" -- are the portals through which we access society's treasure trove of information. They do not publish the algorithms they use to sort and filter information, yet how they work is one of the most important questions of our time. Google's PageRank is a way of measuring the prestige of each web page in terms of who links to it: it reflects the experience of a surfer condemned to click randomly around the web forever. The HITS technique distinguishes "hubs" that point to reputable sources from "authorities," the sources themselves. This helps differentiate communities on the web, which in turn can tease out alternative interpretations of ambiguous query terms. RankNet uses machine learning techniques to rank documents by predicting relevance judgments based on training data. This article explains in non-technical terms how the dragons work.
{"title":"How the dragons work: searching in a web","authors":"I. Witten","doi":"10.1145/1364742.1364747","DOIUrl":"https://doi.org/10.1145/1364742.1364747","url":null,"abstract":"Search engines -- \"web dragons\" -- are the portals through which we access society's treasure trove of information. They do not publish the algorithms they use to sort and filter information, yet how they work is one of the most important questions of our time. Google's PageRank is a way of measuring the prestige of each web page in terms of who links to it: it reflects the experience of a surfer condemned to click randomly around the web forever. The HITS technique distinguishes \"hubs\" that point to reputable sources from \"authorities,\" the sources themselves. This helps differentiate communities on the web, which in turn can tease out alternative interpretations of ambiguous query terms. RankNet uses machine learning techniques to rank documents by predicting relevance judgments based on training data. This article explains in non-technical terms how the dragons work.","PeriodicalId":287514,"journal":{"name":"International Workshop On Research Issues in Digital Libraries","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122035366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Digital Libraries have many forms -- institutional libraries for information dissemination, document repositories for record-keeping, and personal digital libraries for organizing personal thoughts, knowledge, and course of action. Digital image content (scanned or otherwise) is a substantial component of all of these libraries. Processing and analyzing these images include tasks such as document layout understanding, character recognition, functional role labeling, image enhancement, indexing, organizing, restructuring, summarizing, cross linking, redaction, privacy management, and distribution. At the Palo Alto Research Center, we conduct research on several aspects of document analysis for Digital Libraries ranging from raw image transformations to linguistic analysis to interactive sensemaking tools. I shall describe a few recent research activities in the realm of document image analysis or their use in digital libraries.
{"title":"Document image analysis for digital libraries","authors":"Prateek Sarkar","doi":"10.1145/1364742.1364758","DOIUrl":"https://doi.org/10.1145/1364742.1364758","url":null,"abstract":"Digital Libraries have many forms -- institutional libraries for information dissemination, document repositories for record-keeping, and personal digital libraries for organizing personal thoughts, knowledge, and course of action. Digital image content (scanned or otherwise) is a substantial component of all of these libraries. Processing and analyzing these images include tasks such as document layout understanding, character recognition, functional role labeling, image enhancement, indexing, organizing, restructuring, summarizing, cross linking, redaction, privacy management, and distribution.\u0000 At the Palo Alto Research Center, we conduct research on several aspects of document analysis for Digital Libraries ranging from raw image transformations to linguistic analysis to interactive sensemaking tools. I shall describe a few recent research activities in the realm of document image analysis or their use in digital libraries.","PeriodicalId":287514,"journal":{"name":"International Workshop On Research Issues in Digital Libraries","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127394328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The field of fuzzy information systems has grown and is maturing. In this paper, some applications of fuzzy set theory to information retrieval are described, as well as the more recent outcomes of research in this field. Fuzzy set theory is applied to information retrieval with the main aim being to define flexible systems, i.e., systems that can represent and manage the vagueness and subjectivity which characterizes the process of information representation and retrieval, one of the main objectives of artificial intelligence.
{"title":"Vagueness and uncertainty in information retrieval: how can fuzzy sets help?","authors":"D. Kraft, G. Pasi, Gloria Bordogna","doi":"10.1145/1364742.1364746","DOIUrl":"https://doi.org/10.1145/1364742.1364746","url":null,"abstract":"The field of fuzzy information systems has grown and is maturing. In this paper, some applications of fuzzy set theory to information retrieval are described, as well as the more recent outcomes of research in this field. Fuzzy set theory is applied to information retrieval with the main aim being to define flexible systems, i.e., systems that can represent and manage the vagueness and subjectivity which characterizes the process of information representation and retrieval, one of the main objectives of artificial intelligence.","PeriodicalId":287514,"journal":{"name":"International Workshop On Research Issues in Digital Libraries","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123497211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the clustering of large number of images using low-level features, one of the problems encountered is the high dimensional feature space. The high dimensionality of feature spaces leads to unnecessary cost in feature selection and also in the distance measurement during the clustering process. In this paper, we propose an approach to reduce the dimensionality of the feature space based on diffusion maps. In the proposed approach, each image is represented by a set of tiles. A visual keyword-image matrix is derived from classifying these tiles into a set of clusters and counting the occurrence of each cluster in each image of our database. The visual keyword-image matrix is similar to the term-document matrix in information retrieval. We use diffusion maps to reduce the dimensionality of visual keyword matrix. By reducing the dimensionality of the image representation, we can save computation cost significantly. We compare the performance between the proposed approach and the approach that uses the global MPEG-7 color descriptors. The results demonstrate the improvements.
{"title":"Diffusion maps-based image clustering","authors":"R. Agrawal, C.-H. Wu, W. Grosky, F. Fotouhi","doi":"10.1145/1364742.1364754","DOIUrl":"https://doi.org/10.1145/1364742.1364754","url":null,"abstract":"In the clustering of large number of images using low-level features, one of the problems encountered is the high dimensional feature space. The high dimensionality of feature spaces leads to unnecessary cost in feature selection and also in the distance measurement during the clustering process. In this paper, we propose an approach to reduce the dimensionality of the feature space based on diffusion maps. In the proposed approach, each image is represented by a set of tiles. A visual keyword-image matrix is derived from classifying these tiles into a set of clusters and counting the occurrence of each cluster in each image of our database. The visual keyword-image matrix is similar to the term-document matrix in information retrieval. We use diffusion maps to reduce the dimensionality of visual keyword matrix. By reducing the dimensionality of the image representation, we can save computation cost significantly. We compare the performance between the proposed approach and the approach that uses the global MPEG-7 color descriptors. The results demonstrate the improvements.","PeriodicalId":287514,"journal":{"name":"International Workshop On Research Issues in Digital Libraries","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126065183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We examine some research issues in pattern recognition and image processing that have been spurred by the needs of digital libraries. Broader -- and not only linguistic -- context must be introduced in character recognition on low-contrast, tightly-set documents because the conversion of documents to coded (searchable) form is lagging far behind conversion to image formats. At the same time, the prevalence of imaged documents over coded documents gives rise to interesting research problems in interactive annotation of document images. At the level of circulation, reformatting document images to accommodate diverse user needs remains a challenge.
{"title":"Digitizing, coding, annotating, disseminating, and preserving documents","authors":"G. Nagy","doi":"10.1145/1364742.1364757","DOIUrl":"https://doi.org/10.1145/1364742.1364757","url":null,"abstract":"We examine some research issues in pattern recognition and image processing that have been spurred by the needs of digital libraries. Broader -- and not only linguistic -- context must be introduced in character recognition on low-contrast, tightly-set documents because the conversion of documents to coded (searchable) form is lagging far behind conversion to image formats. At the same time, the prevalence of imaged documents over coded documents gives rise to interesting research problems in interactive annotation of document images. At the level of circulation, reformatting document images to accommodate diverse user needs remains a challenge.","PeriodicalId":287514,"journal":{"name":"International Workshop On Research Issues in Digital Libraries","volume":"11 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129480720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}