Andrew Yates, Daniel S. Dotson, Stephanie J. Schulte, R. Ramnath
{"title":"基于概念拓扑的大内容分层分类","authors":"Andrew Yates, Daniel S. Dotson, Stephanie J. Schulte, R. Ramnath","doi":"10.1080/19386389.2018.1538610","DOIUrl":null,"url":null,"abstract":"Abstract Methods that are both computationally feasible and practically effective are needed to make sense of big corpuses of content, or “big content.” For example, supervised categorization techniques for open-access academic publishing are ill-suited for automated categorization because they rely on an existing categorization scheme, but no supervised scheme can stay abreast of the rapidly evolving landscape of scholarly work. This problem also applies to any domain with very large document corpuses where no good categorization scheme exists. To address this challenge, we present an unsupervised method to fit a hierarchical categorization scheme to a corpus based on clustering the network of shared concepts in the corpus, or its “concept topology.” Our method potentially applies to any type of content, and it scales to large networks of millions of vertices. We have demonstrated the application of our method to a corpus of 1.5 million scholarly texts representing the majority of open access (OA) academic publications on the web, validating our results using expert librarian annotations. We have made our datasets openly accessible for research by others. We believe that our resulting categorization scheme best represents OA academic publishing as it exists today.","PeriodicalId":39057,"journal":{"name":"Journal of Library Metadata","volume":"46 1","pages":"113 - 134"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Hierarchical Categorization of Big Content Using Concept Topology\",\"authors\":\"Andrew Yates, Daniel S. Dotson, Stephanie J. Schulte, R. Ramnath\",\"doi\":\"10.1080/19386389.2018.1538610\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract Methods that are both computationally feasible and practically effective are needed to make sense of big corpuses of content, or “big content.” For example, supervised categorization techniques for open-access academic publishing are ill-suited for automated categorization because they rely on an existing categorization scheme, but no supervised scheme can stay abreast of the rapidly evolving landscape of scholarly work. This problem also applies to any domain with very large document corpuses where no good categorization scheme exists. To address this challenge, we present an unsupervised method to fit a hierarchical categorization scheme to a corpus based on clustering the network of shared concepts in the corpus, or its “concept topology.” Our method potentially applies to any type of content, and it scales to large networks of millions of vertices. We have demonstrated the application of our method to a corpus of 1.5 million scholarly texts representing the majority of open access (OA) academic publications on the web, validating our results using expert librarian annotations. We have made our datasets openly accessible for research by others. We believe that our resulting categorization scheme best represents OA academic publishing as it exists today.\",\"PeriodicalId\":39057,\"journal\":{\"name\":\"Journal of Library Metadata\",\"volume\":\"46 1\",\"pages\":\"113 - 134\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-10-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Library Metadata\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1080/19386389.2018.1538610\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"Social Sciences\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Library Metadata","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/19386389.2018.1538610","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Social Sciences","Score":null,"Total":0}
Hierarchical Categorization of Big Content Using Concept Topology
Abstract Methods that are both computationally feasible and practically effective are needed to make sense of big corpuses of content, or “big content.” For example, supervised categorization techniques for open-access academic publishing are ill-suited for automated categorization because they rely on an existing categorization scheme, but no supervised scheme can stay abreast of the rapidly evolving landscape of scholarly work. This problem also applies to any domain with very large document corpuses where no good categorization scheme exists. To address this challenge, we present an unsupervised method to fit a hierarchical categorization scheme to a corpus based on clustering the network of shared concepts in the corpus, or its “concept topology.” Our method potentially applies to any type of content, and it scales to large networks of millions of vertices. We have demonstrated the application of our method to a corpus of 1.5 million scholarly texts representing the majority of open access (OA) academic publications on the web, validating our results using expert librarian annotations. We have made our datasets openly accessible for research by others. We believe that our resulting categorization scheme best represents OA academic publishing as it exists today.