{"title":"使用电子表格发现企业概念","authors":"Keqian Li, Yeye He, Kris Ganjam","doi":"10.1145/3097983.3098102","DOIUrl":null,"url":null,"abstract":"Existing work on knowledge discovery focuses on using natural language techniques to extract entities and relationships from textual documents. However, today relational tables are abundant in quantities, and are often well-structured with coherent data values. So far these rich relational tables have been largely overlooked for the purpose of knowledge discovery. In this work, we study the problem of building concept hierarchies using a large corpus of enterprise spreadsheet tables. Our method first groups distinct values from tables into a large hierarchical tre based on co-occurrence statistics. We then \"summarize\" the large tree by selecting important tree nodes that are likely good concepts based on how well they \"describe\" the original corpus. The result is a small concept hierarchy that is easy for humans to understand and curate. Our end-to-end algorithms are designed to run on Map-Reduce and to scale to large corpus. Experiments using real enterprise spreadsheet corpus show that proposed approach can generate concepts with high quality.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":"{\"title\":\"Discovering Enterprise Concepts Using Spreadsheet Tables\",\"authors\":\"Keqian Li, Yeye He, Kris Ganjam\",\"doi\":\"10.1145/3097983.3098102\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Existing work on knowledge discovery focuses on using natural language techniques to extract entities and relationships from textual documents. However, today relational tables are abundant in quantities, and are often well-structured with coherent data values. So far these rich relational tables have been largely overlooked for the purpose of knowledge discovery. In this work, we study the problem of building concept hierarchies using a large corpus of enterprise spreadsheet tables. Our method first groups distinct values from tables into a large hierarchical tre based on co-occurrence statistics. We then \\\"summarize\\\" the large tree by selecting important tree nodes that are likely good concepts based on how well they \\\"describe\\\" the original corpus. The result is a small concept hierarchy that is easy for humans to understand and curate. Our end-to-end algorithms are designed to run on Map-Reduce and to scale to large corpus. Experiments using real enterprise spreadsheet corpus show that proposed approach can generate concepts with high quality.\",\"PeriodicalId\":314049,\"journal\":{\"name\":\"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-08-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"14\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3097983.3098102\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3097983.3098102","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Discovering Enterprise Concepts Using Spreadsheet Tables
Existing work on knowledge discovery focuses on using natural language techniques to extract entities and relationships from textual documents. However, today relational tables are abundant in quantities, and are often well-structured with coherent data values. So far these rich relational tables have been largely overlooked for the purpose of knowledge discovery. In this work, we study the problem of building concept hierarchies using a large corpus of enterprise spreadsheet tables. Our method first groups distinct values from tables into a large hierarchical tre based on co-occurrence statistics. We then "summarize" the large tree by selecting important tree nodes that are likely good concepts based on how well they "describe" the original corpus. The result is a small concept hierarchy that is easy for humans to understand and curate. Our end-to-end algorithms are designed to run on Map-Reduce and to scale to large corpus. Experiments using real enterprise spreadsheet corpus show that proposed approach can generate concepts with high quality.