Sven Langenecker, Christoph Sturm, Christian Schalles, Carsten Binnig
{"title":"sportstable:一个新的语义类型检测语料库(扩展版)","authors":"Sven Langenecker, Christoph Sturm, Christian Schalles, Carsten Binnig","doi":"10.1007/s13222-023-00457-y","DOIUrl":null,"url":null,"abstract":"Abstract Table corpora such as VizNet or TURL which contain annotated semantic types per column are important to build machine learning models for the task of automatic semantic type detection. However, there is a huge discrepancy between corpora and real-world data lakes since they contain a huge fraction of numerical data which are not present in existing corpora. Hence, in this paper, we introduce a new corpus that contains a much higher proportion of numerical columns than existing corpora. To reflect the distribution in real-world data lakes, our corpus SportsTables has on average approx. 86% numerical columns, posing new challenges to existing semantic type detection models which have mainly targeted non-numerical columns so far. To demonstrate this effect, we show in this extended version paper of [18] the results of an extensive study using four different state-of-the-art approaches for semantic type detection on our new corpus. Overall, the results demonstrate significant performance differences in predicting semantic types for textual and numerical data.","PeriodicalId":72771,"journal":{"name":"Datenbank-Spektrum : Zeitschrift fur Datenbanktechnologie : Organ der Fachgruppe Datenbanken der Gesellschaft fur Informatik e.V","volume":"223 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SportsTables: A New Corpus for Semantic Type Detection (Extended Version)\",\"authors\":\"Sven Langenecker, Christoph Sturm, Christian Schalles, Carsten Binnig\",\"doi\":\"10.1007/s13222-023-00457-y\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract Table corpora such as VizNet or TURL which contain annotated semantic types per column are important to build machine learning models for the task of automatic semantic type detection. However, there is a huge discrepancy between corpora and real-world data lakes since they contain a huge fraction of numerical data which are not present in existing corpora. Hence, in this paper, we introduce a new corpus that contains a much higher proportion of numerical columns than existing corpora. To reflect the distribution in real-world data lakes, our corpus SportsTables has on average approx. 86% numerical columns, posing new challenges to existing semantic type detection models which have mainly targeted non-numerical columns so far. To demonstrate this effect, we show in this extended version paper of [18] the results of an extensive study using four different state-of-the-art approaches for semantic type detection on our new corpus. Overall, the results demonstrate significant performance differences in predicting semantic types for textual and numerical data.\",\"PeriodicalId\":72771,\"journal\":{\"name\":\"Datenbank-Spektrum : Zeitschrift fur Datenbanktechnologie : Organ der Fachgruppe Datenbanken der Gesellschaft fur Informatik e.V\",\"volume\":\"223 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-10-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Datenbank-Spektrum : Zeitschrift fur Datenbanktechnologie : Organ der Fachgruppe Datenbanken der Gesellschaft fur Informatik e.V\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s13222-023-00457-y\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Datenbank-Spektrum : Zeitschrift fur Datenbanktechnologie : Organ der Fachgruppe Datenbanken der Gesellschaft fur Informatik e.V","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s13222-023-00457-y","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
SportsTables: A New Corpus for Semantic Type Detection (Extended Version)
Abstract Table corpora such as VizNet or TURL which contain annotated semantic types per column are important to build machine learning models for the task of automatic semantic type detection. However, there is a huge discrepancy between corpora and real-world data lakes since they contain a huge fraction of numerical data which are not present in existing corpora. Hence, in this paper, we introduce a new corpus that contains a much higher proportion of numerical columns than existing corpora. To reflect the distribution in real-world data lakes, our corpus SportsTables has on average approx. 86% numerical columns, posing new challenges to existing semantic type detection models which have mainly targeted non-numerical columns so far. To demonstrate this effect, we show in this extended version paper of [18] the results of an extensive study using four different state-of-the-art approaches for semantic type detection on our new corpus. Overall, the results demonstrate significant performance differences in predicting semantic types for textual and numerical data.