{"title":"生成Twitter用户的标记数据集","authors":"Yasen Kiprov, Pepa Gencheva, Ivan Koychev","doi":"10.1145/3099023.3099048","DOIUrl":null,"url":null,"abstract":"In this paper we present a simple, yet powerful approach to generating labeled datasets of Twitter1 users. Our focus falls on sensitive personal details, shared as background information in tweets. Such tweets avoid the focus of user's attention and also tend to resist the vast amounts of humor, wishes or hypothetical thinking typical for tweets. Our approach combines selecting search queries, followed up by a semi-supervised filtering of indicative messages. We create datasets in several unrelated domains and prove that all sorts of target groups can be built with minimal manual annotator effort. The generated datasets include separate groups of users with specific characteristics: pet ownership, blood pressure, diabetes and psychotropic medicine usage, for which to our knowledge manually labeled data was previously not available. Our search-based approach is also used to generate a cross-domain corpus, matching Twitter users with their Yelp2 profiles.","PeriodicalId":219391,"journal":{"name":"Adjunct Publication of the 25th Conference on User Modeling, Adaptation and Personalization","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Generating Labeled Datasets of Twitter Users\",\"authors\":\"Yasen Kiprov, Pepa Gencheva, Ivan Koychev\",\"doi\":\"10.1145/3099023.3099048\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper we present a simple, yet powerful approach to generating labeled datasets of Twitter1 users. Our focus falls on sensitive personal details, shared as background information in tweets. Such tweets avoid the focus of user's attention and also tend to resist the vast amounts of humor, wishes or hypothetical thinking typical for tweets. Our approach combines selecting search queries, followed up by a semi-supervised filtering of indicative messages. We create datasets in several unrelated domains and prove that all sorts of target groups can be built with minimal manual annotator effort. The generated datasets include separate groups of users with specific characteristics: pet ownership, blood pressure, diabetes and psychotropic medicine usage, for which to our knowledge manually labeled data was previously not available. Our search-based approach is also used to generate a cross-domain corpus, matching Twitter users with their Yelp2 profiles.\",\"PeriodicalId\":219391,\"journal\":{\"name\":\"Adjunct Publication of the 25th Conference on User Modeling, Adaptation and Personalization\",\"volume\":\"44 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-07-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Adjunct Publication of the 25th Conference on User Modeling, Adaptation and Personalization\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3099023.3099048\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Adjunct Publication of the 25th Conference on User Modeling, Adaptation and Personalization","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3099023.3099048","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
In this paper we present a simple, yet powerful approach to generating labeled datasets of Twitter1 users. Our focus falls on sensitive personal details, shared as background information in tweets. Such tweets avoid the focus of user's attention and also tend to resist the vast amounts of humor, wishes or hypothetical thinking typical for tweets. Our approach combines selecting search queries, followed up by a semi-supervised filtering of indicative messages. We create datasets in several unrelated domains and prove that all sorts of target groups can be built with minimal manual annotator effort. The generated datasets include separate groups of users with specific characteristics: pet ownership, blood pressure, diabetes and psychotropic medicine usage, for which to our knowledge manually labeled data was previously not available. Our search-based approach is also used to generate a cross-domain corpus, matching Twitter users with their Yelp2 profiles.