{"title":"Data structures for information retrieval","authors":"D. L. Nkweteyim","doi":"10.1109/ISTAFRICA.2014.6880643","DOIUrl":null,"url":null,"abstract":"The process of efficiently indexing large document collections for information retrieval places large demands on a computer's memory and processor, and requires judicious use of these resources. In this paper, we describe our approach to constructing such an index based on the vector-space model (VSM). We review the stages involved in generating an index, for weighting the index terms, and for representing documents in the VSM. We explain our choice of data structures from the parsing of the document collection through the generation of index terms, to generation of document representations. We explain tradeoffs in our choice of data structures. We then demonstrate the approach using the OHSUMED data set. Our results show that even with only a modest amount of main memory (4 GB), large data sets such as the OHSUMED data set can be quickly indexed.","PeriodicalId":248893,"journal":{"name":"2014 IST-Africa Conference Proceedings","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IST-Africa Conference Proceedings","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISTAFRICA.2014.6880643","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
The process of efficiently indexing large document collections for information retrieval places large demands on a computer's memory and processor, and requires judicious use of these resources. In this paper, we describe our approach to constructing such an index based on the vector-space model (VSM). We review the stages involved in generating an index, for weighting the index terms, and for representing documents in the VSM. We explain our choice of data structures from the parsing of the document collection through the generation of index terms, to generation of document representations. We explain tradeoffs in our choice of data structures. We then demonstrate the approach using the OHSUMED data set. Our results show that even with only a modest amount of main memory (4 GB), large data sets such as the OHSUMED data set can be quickly indexed.