Weizhong Zhao, M. VenkataSwamy, Gang Chen, Xiaowei Xu
{"title":"基于余弦相似度上界的快速信息检索与社会网络挖掘","authors":"Weizhong Zhao, M. VenkataSwamy, Gang Chen, Xiaowei Xu","doi":"10.1109/SocialCom.2013.147","DOIUrl":null,"url":null,"abstract":"Similarity search is a key function for many applications including databases, pattern recognition and recommendation systems to name a few. In this paper, we first propose ε-query, a similarity search based on the popular cosine similarity for information retrieval and social network analysis. In contrast to traditional similarity search ε-query returns results whose cosine similarities with the query are larger than a threshold ε. The major contribution of this paper is an efficient ε-query processing algorithm by using an upper bound for binary data. Our evaluation using two of the largest publicly available real datasets, ClueWeb09 and Twitter, demonstrated that the proposed method could achieve several orders of magnitude speedup in comparison with the traditional approach. Last but not least, we applied the proposed method for information retrieval from ClueWeb and finding community structures from Twitter. The outcome further proved the effectiveness of the proposed method.","PeriodicalId":129308,"journal":{"name":"2013 International Conference on Social Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Fast Information Retrieval and Social Network Mining via Cosine Similarity Upper Bound\",\"authors\":\"Weizhong Zhao, M. VenkataSwamy, Gang Chen, Xiaowei Xu\",\"doi\":\"10.1109/SocialCom.2013.147\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Similarity search is a key function for many applications including databases, pattern recognition and recommendation systems to name a few. In this paper, we first propose ε-query, a similarity search based on the popular cosine similarity for information retrieval and social network analysis. In contrast to traditional similarity search ε-query returns results whose cosine similarities with the query are larger than a threshold ε. The major contribution of this paper is an efficient ε-query processing algorithm by using an upper bound for binary data. Our evaluation using two of the largest publicly available real datasets, ClueWeb09 and Twitter, demonstrated that the proposed method could achieve several orders of magnitude speedup in comparison with the traditional approach. Last but not least, we applied the proposed method for information retrieval from ClueWeb and finding community structures from Twitter. The outcome further proved the effectiveness of the proposed method.\",\"PeriodicalId\":129308,\"journal\":{\"name\":\"2013 International Conference on Social Computing\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-09-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 International Conference on Social Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SocialCom.2013.147\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 International Conference on Social Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SocialCom.2013.147","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Fast Information Retrieval and Social Network Mining via Cosine Similarity Upper Bound
Similarity search is a key function for many applications including databases, pattern recognition and recommendation systems to name a few. In this paper, we first propose ε-query, a similarity search based on the popular cosine similarity for information retrieval and social network analysis. In contrast to traditional similarity search ε-query returns results whose cosine similarities with the query are larger than a threshold ε. The major contribution of this paper is an efficient ε-query processing algorithm by using an upper bound for binary data. Our evaluation using two of the largest publicly available real datasets, ClueWeb09 and Twitter, demonstrated that the proposed method could achieve several orders of magnitude speedup in comparison with the traditional approach. Last but not least, we applied the proposed method for information retrieval from ClueWeb and finding community structures from Twitter. The outcome further proved the effectiveness of the proposed method.