模糊电视节目标题推文采集训练数据的自动标注

2013 International Conference on Social Computing Pub Date : 2013-09-08 DOI:10.1109/SocialCom.2013.119

M. Erdmann, Erik Ward, K. Ikeda, Gen Hattori, C. Ono, Y. Takishima

{"title":"模糊电视节目标题推文采集训练数据的自动标注","authors":"M. Erdmann, Erik Ward, K. Ikeda, Gen Hattori, C. Ono, Y. Takishima","doi":"10.1109/SocialCom.2013.119","DOIUrl":null,"url":null,"abstract":"Twitter is a popular medium for sharing opinions on TV programs, and the analysis of TV related tweets is attracting a lot of interest. However, when collecting all tweets containing a given TV program title, we obtain a large number of unrelated tweets, due to the fact that many of the TV program titles are ambiguous. Using supervised learning, TV related tweets can be collected with high accuracy. The goal of our proposed method is to automate the labeling process, in order to eliminate the cost required for data labeling without sacrificing classification accuracy. When creating the training data, we use only tweets of unambiguous TV program titles. In order to decide whether a TV program title is ambiguous, we automatically determine whether it can be used as a common expression or named entity. In two experiments, in which we collected tweets for 32 ambiguous TV program titles, we achieved the same (78.2%) or even higher classification accuracy (79.1%) with automatically labeled training data as with manually labeled data, while effectively eliminating labeling costs.","PeriodicalId":129308,"journal":{"name":"2013 International Conference on Social Computing","volume":"111 3S 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Automatic Labeling of Training Data for Collecting Tweets for Ambiguous TV Program Titles\",\"authors\":\"M. Erdmann, Erik Ward, K. Ikeda, Gen Hattori, C. Ono, Y. Takishima\",\"doi\":\"10.1109/SocialCom.2013.119\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Twitter is a popular medium for sharing opinions on TV programs, and the analysis of TV related tweets is attracting a lot of interest. However, when collecting all tweets containing a given TV program title, we obtain a large number of unrelated tweets, due to the fact that many of the TV program titles are ambiguous. Using supervised learning, TV related tweets can be collected with high accuracy. The goal of our proposed method is to automate the labeling process, in order to eliminate the cost required for data labeling without sacrificing classification accuracy. When creating the training data, we use only tweets of unambiguous TV program titles. In order to decide whether a TV program title is ambiguous, we automatically determine whether it can be used as a common expression or named entity. In two experiments, in which we collected tweets for 32 ambiguous TV program titles, we achieved the same (78.2%) or even higher classification accuracy (79.1%) with automatically labeled training data as with manually labeled data, while effectively eliminating labeling costs.\",\"PeriodicalId\":129308,\"journal\":{\"name\":\"2013 International Conference on Social Computing\",\"volume\":\"111 3S 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-09-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 International Conference on Social Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SocialCom.2013.119\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 International Conference on Social Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SocialCom.2013.119","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

Twitter是一种流行的分享电视节目意见的媒体，对电视相关推文的分析引起了很多人的兴趣。然而，当收集包含给定电视节目标题的所有tweet时，我们会得到大量不相关的tweet，因为许多电视节目标题是模糊的。利用监督学习，可以以较高的准确率收集电视相关的推文。我们提出的方法的目标是自动化标注过程，以便在不牺牲分类精度的情况下消除数据标注所需的成本。在创建训练数据时，我们只使用具有明确电视节目标题的tweet。为了确定电视节目标题是否有歧义，我们自动确定它是否可以用作公共表达或命名实体。在两个实验中，我们收集了32个有歧义的电视节目标题的推文，我们使用自动标记的训练数据获得了与手动标记数据相同(78.2%)甚至更高的分类准确率(79.1%)，同时有效地消除了标记成本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Automatic Labeling of Training Data for Collecting Tweets for Ambiguous TV Program Titles

Twitter is a popular medium for sharing opinions on TV programs, and the analysis of TV related tweets is attracting a lot of interest. However, when collecting all tweets containing a given TV program title, we obtain a large number of unrelated tweets, due to the fact that many of the TV program titles are ambiguous. Using supervised learning, TV related tweets can be collected with high accuracy. The goal of our proposed method is to automate the labeling process, in order to eliminate the cost required for data labeling without sacrificing classification accuracy. When creating the training data, we use only tweets of unambiguous TV program titles. In order to decide whether a TV program title is ambiguous, we automatically determine whether it can be used as a common expression or named entity. In two experiments, in which we collected tweets for 32 ambiguous TV program titles, we achieved the same (78.2%) or even higher classification accuracy (79.1%) with automatically labeled training data as with manually labeled data, while effectively eliminating labeling costs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2013 International Conference on Social Computing

自引率

0.00%

发文量

期刊最新文献

A Novel Group Recommendation Algorithm with Collaborative Filtering Access Control Policy Extraction from Unconstrained Natural Language Text Stock Market Manipulation Using Cyberattacks Together with Misinformation Disseminated through Social Media Friendship Prediction on Social Network Users An Empirical Comparison of Graph Databases