{"title":"基于Shapley值的标记义消歧特征词选择","authors":"Meshesha Legesse, G. Gianini, Dereje Teferi","doi":"10.1109/SITIS.2016.45","DOIUrl":null,"url":null,"abstract":"In tag-word disambiguation, a word is assigned to a specific context chosen among the different ones to which it is related. Relatedness to a context is often defined based on the co-occurrence of the target word with other words (context words) in sentences of a specific corpus. The overall disambiguation process can be thought as a classification process, where the context words play the role of features for the target. A problem with this approach is that the large number of possible context words can reduce the classification performance, both in terms of computational effort and in terms of quality of the outcome. Feature selection can improve the process in both regards, by reducing the overall feature space to a manageable size with high information content. In this work we propose to use, in disambiguation, a feature selection approach based on the Shapley Value (SV) - a Coalitional Game Theory related metrics, measuring the importance of a component within a coalition. By including in the feature set only the words with the highest Shapley Value, we obtain remarkable quality and performance improvements. The problem of the exponential complexity in the exact SV computation is avoided by an approximate computation based on sampling. We demonstrate the effectiveness of this method and of the sampling approach results, by using both a synthetic language corpus and a real world linguistic corpus.","PeriodicalId":403704,"journal":{"name":"2016 12th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Selecting Feature-Words in Tag Sense Disambiguation Based on Their Shapley Value\",\"authors\":\"Meshesha Legesse, G. Gianini, Dereje Teferi\",\"doi\":\"10.1109/SITIS.2016.45\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In tag-word disambiguation, a word is assigned to a specific context chosen among the different ones to which it is related. Relatedness to a context is often defined based on the co-occurrence of the target word with other words (context words) in sentences of a specific corpus. The overall disambiguation process can be thought as a classification process, where the context words play the role of features for the target. A problem with this approach is that the large number of possible context words can reduce the classification performance, both in terms of computational effort and in terms of quality of the outcome. Feature selection can improve the process in both regards, by reducing the overall feature space to a manageable size with high information content. In this work we propose to use, in disambiguation, a feature selection approach based on the Shapley Value (SV) - a Coalitional Game Theory related metrics, measuring the importance of a component within a coalition. By including in the feature set only the words with the highest Shapley Value, we obtain remarkable quality and performance improvements. The problem of the exponential complexity in the exact SV computation is avoided by an approximate computation based on sampling. We demonstrate the effectiveness of this method and of the sampling approach results, by using both a synthetic language corpus and a real world linguistic corpus.\",\"PeriodicalId\":403704,\"journal\":{\"name\":\"2016 12th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS)\",\"volume\":\"23 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 12th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SITIS.2016.45\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 12th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SITIS.2016.45","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Selecting Feature-Words in Tag Sense Disambiguation Based on Their Shapley Value
In tag-word disambiguation, a word is assigned to a specific context chosen among the different ones to which it is related. Relatedness to a context is often defined based on the co-occurrence of the target word with other words (context words) in sentences of a specific corpus. The overall disambiguation process can be thought as a classification process, where the context words play the role of features for the target. A problem with this approach is that the large number of possible context words can reduce the classification performance, both in terms of computational effort and in terms of quality of the outcome. Feature selection can improve the process in both regards, by reducing the overall feature space to a manageable size with high information content. In this work we propose to use, in disambiguation, a feature selection approach based on the Shapley Value (SV) - a Coalitional Game Theory related metrics, measuring the importance of a component within a coalition. By including in the feature set only the words with the highest Shapley Value, we obtain remarkable quality and performance improvements. The problem of the exponential complexity in the exact SV computation is avoided by an approximate computation based on sampling. We demonstrate the effectiveness of this method and of the sampling approach results, by using both a synthetic language corpus and a real world linguistic corpus.