{"title":"Cleaning Collections Data Using OpenRefine","authors":"Elizabeth Sterner","doi":"10.29173/istl30","DOIUrl":null,"url":null,"abstract":"Collection maintenance, including weeding, is a key component of my position as an academic science librarian. In an ideal world we receive perfect data that are clean and ready to use. But unfortunately, that is not always the case. In large deselection projects you might receive holdings and circulation records in separate files which, once combined, may contain many undesired duplicated line items. I will demonstrate how you can effectively and quickly use the facet row feature in OpenRefine to deduplicate data. The benefit of this method is that you select which of the duplicated items will be kept and which will be deleted. Once OpenRefine is downloaded and opened, you work in a web user interface to upload your data, clean and transform the data, and then download from the browser to a CSV file in Excel. With practice, I have found that this only takes a few minutes for thousands of line items, and ensures I am able to select the data I want.","PeriodicalId":39287,"journal":{"name":"Issues in Science and Technology Librarianship","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2019-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.29173/istl30","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Issues in Science and Technology Librarianship","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.29173/istl30","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Social Sciences","Score":null,"Total":0}
引用次数: 4
Abstract
Collection maintenance, including weeding, is a key component of my position as an academic science librarian. In an ideal world we receive perfect data that are clean and ready to use. But unfortunately, that is not always the case. In large deselection projects you might receive holdings and circulation records in separate files which, once combined, may contain many undesired duplicated line items. I will demonstrate how you can effectively and quickly use the facet row feature in OpenRefine to deduplicate data. The benefit of this method is that you select which of the duplicated items will be kept and which will be deleted. Once OpenRefine is downloaded and opened, you work in a web user interface to upload your data, clean and transform the data, and then download from the browser to a CSV file in Excel. With practice, I have found that this only takes a few minutes for thousands of line items, and ensures I am able to select the data I want.