{"title":"在线文本的作者身份识别","authors":"Richmond Hong Rui Tan, F. S. Tsai","doi":"10.1109/CW.2010.50","DOIUrl":null,"url":null,"abstract":"Authorship identification for online text such as blogs and e-books is a challenging problem as these documents do not have a considerable amount of content. Therefore, identification is much harder than other documents such as books and reports. The paper investigates the choice of features and classifier accuracy which are suitable for such texts. Syntactic features are found to be good for large data sets, whereas lexical features are good for small data sets. The results can be used to customize and further improve authorship detection techniques according to the characteristics of the writing samples.","PeriodicalId":410870,"journal":{"name":"2010 International Conference on Cyberworlds","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":"{\"title\":\"Authorship Identification for Online Text\",\"authors\":\"Richmond Hong Rui Tan, F. S. Tsai\",\"doi\":\"10.1109/CW.2010.50\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Authorship identification for online text such as blogs and e-books is a challenging problem as these documents do not have a considerable amount of content. Therefore, identification is much harder than other documents such as books and reports. The paper investigates the choice of features and classifier accuracy which are suitable for such texts. Syntactic features are found to be good for large data sets, whereas lexical features are good for small data sets. The results can be used to customize and further improve authorship detection techniques according to the characteristics of the writing samples.\",\"PeriodicalId\":410870,\"journal\":{\"name\":\"2010 International Conference on Cyberworlds\",\"volume\":\"36 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-10-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"26\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 International Conference on Cyberworlds\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CW.2010.50\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 International Conference on Cyberworlds","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CW.2010.50","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Authorship identification for online text such as blogs and e-books is a challenging problem as these documents do not have a considerable amount of content. Therefore, identification is much harder than other documents such as books and reports. The paper investigates the choice of features and classifier accuracy which are suitable for such texts. Syntactic features are found to be good for large data sets, whereas lexical features are good for small data sets. The results can be used to customize and further improve authorship detection techniques according to the characteristics of the writing samples.