{"title":"用于文档分类的字符串核:比较研究","authors":"Nikhil V. Chandran, A. S., A. V. S.","doi":"10.1109/ICITIIT54346.2022.9744134","DOIUrl":null,"url":null,"abstract":"In machine learning and data mining, String Kernels combined with classifiers like Support Vector Machines (SVM) show state-of-the-art results for tasks such as text classification. Traditional pairwise comparisons of strings on large datasets are computationally expensive and result in quadratic runtimes. This work compares the performance of various String Kernels and similarity measures on the document classification task. We compare different String Kernels such as Spectrum Kernel, String Subsequence Kernel, Weighted Degree Kernel, and Distance Substitution Kernel in this paper for classifying text documents. A detailed comparative study of these Kernel techniques on real-life document corpus such as Reuters-21578 shows different insights when used with and without other feature extraction techniques. The results indicate that string similarity measures give the best performance when run over the entire corpus but for small and medium-sized datasets. The complexity increases with an increase in the size of the dataset.","PeriodicalId":184353,"journal":{"name":"2022 International Conference on Innovative Trends in Information Technology (ICITIIT)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"String Kernels for Document Classification: A Comparative Study\",\"authors\":\"Nikhil V. Chandran, A. S., A. V. S.\",\"doi\":\"10.1109/ICITIIT54346.2022.9744134\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In machine learning and data mining, String Kernels combined with classifiers like Support Vector Machines (SVM) show state-of-the-art results for tasks such as text classification. Traditional pairwise comparisons of strings on large datasets are computationally expensive and result in quadratic runtimes. This work compares the performance of various String Kernels and similarity measures on the document classification task. We compare different String Kernels such as Spectrum Kernel, String Subsequence Kernel, Weighted Degree Kernel, and Distance Substitution Kernel in this paper for classifying text documents. A detailed comparative study of these Kernel techniques on real-life document corpus such as Reuters-21578 shows different insights when used with and without other feature extraction techniques. The results indicate that string similarity measures give the best performance when run over the entire corpus but for small and medium-sized datasets. The complexity increases with an increase in the size of the dataset.\",\"PeriodicalId\":184353,\"journal\":{\"name\":\"2022 International Conference on Innovative Trends in Information Technology (ICITIIT)\",\"volume\":\"27 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-02-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 International Conference on Innovative Trends in Information Technology (ICITIIT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICITIIT54346.2022.9744134\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Innovative Trends in Information Technology (ICITIIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICITIIT54346.2022.9744134","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
String Kernels for Document Classification: A Comparative Study
In machine learning and data mining, String Kernels combined with classifiers like Support Vector Machines (SVM) show state-of-the-art results for tasks such as text classification. Traditional pairwise comparisons of strings on large datasets are computationally expensive and result in quadratic runtimes. This work compares the performance of various String Kernels and similarity measures on the document classification task. We compare different String Kernels such as Spectrum Kernel, String Subsequence Kernel, Weighted Degree Kernel, and Distance Substitution Kernel in this paper for classifying text documents. A detailed comparative study of these Kernel techniques on real-life document corpus such as Reuters-21578 shows different insights when used with and without other feature extraction techniques. The results indicate that string similarity measures give the best performance when run over the entire corpus but for small and medium-sized datasets. The complexity increases with an increase in the size of the dataset.