Natural Language-based Integration of Online Review Datasets for Identification of Sex Trafficking Businesses.

Maria Diaz, Anand Panangadan
{"title":"Natural Language-based Integration of Online Review Datasets for Identification of Sex Trafficking Businesses.","authors":"Maria Diaz,&nbsp;Anand Panangadan","doi":"10.1109/iri49571.2020.00044","DOIUrl":null,"url":null,"abstract":"<p><p>There is increasing interest in automatically identifying advertisements related to sex trafficking in online review sites. The main challenge is to identify the changing patterns in text reviews that are used to indicate illegal businesses. This work describes a novel means of identifying illegal business advertisements using natural language processing and machine learning. The method relies on building a training set of reviews of known illegal businesses. This training data is created by integrating a small high precision set of known illegal businesses (Rubmaps) with a large collection of online reviews from a general purpose review site (Yelp). Standard natural language pre-processing techniques are then applied to the text reviews and converted into a bag-of-words model with Term frequency-inverse document weighting. The resulting Document-Term matrix is used to train a classifier and then to identify suspicious activity from the remaining reviews. This approach therefore leverages a high-precision, low-recall dataset to identify relevant instances from the large low-precision, high-recall dataset. The approach was evaluated on a collection of 456,050 reviews from the Yelp online forum with a variety of machine learning algorithms and different number of text features. The method achieved a f1-score of 0.77 with a random forests classifier. The number of text features could also be reduced from 1,473 to 447 for a compact classifier with only a small drop in accuracy.</p>","PeriodicalId":93159,"journal":{"name":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","volume":"2020 ","pages":"259-264"},"PeriodicalIF":0.0000,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/iri49571.2020.00044","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/iri49571.2020.00044","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2020/9/10 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

Abstract

There is increasing interest in automatically identifying advertisements related to sex trafficking in online review sites. The main challenge is to identify the changing patterns in text reviews that are used to indicate illegal businesses. This work describes a novel means of identifying illegal business advertisements using natural language processing and machine learning. The method relies on building a training set of reviews of known illegal businesses. This training data is created by integrating a small high precision set of known illegal businesses (Rubmaps) with a large collection of online reviews from a general purpose review site (Yelp). Standard natural language pre-processing techniques are then applied to the text reviews and converted into a bag-of-words model with Term frequency-inverse document weighting. The resulting Document-Term matrix is used to train a classifier and then to identify suspicious activity from the remaining reviews. This approach therefore leverages a high-precision, low-recall dataset to identify relevant instances from the large low-precision, high-recall dataset. The approach was evaluated on a collection of 456,050 reviews from the Yelp online forum with a variety of machine learning algorithms and different number of text features. The method achieved a f1-score of 0.77 with a random forests classifier. The number of text features could also be reduced from 1,473 to 447 for a compact classifier with only a small drop in accuracy.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于自然语言的在线评论数据集集成,用于识别性交易业务。
人们对自动识别在线评论网站上与性交易有关的广告越来越感兴趣。主要的挑战是识别用于指示非法业务的文本评论中的变化模式。这项工作描述了一种使用自然语言处理和机器学习识别非法商业广告的新方法。该方法依赖于建立一个对已知非法业务进行审查的训练集。这个训练数据是通过整合一个小的高精度的已知非法企业(Rubmaps)和一个通用评论网站(Yelp)的大量在线评论来创建的。然后将标准的自然语言预处理技术应用于文本评审,并将其转换为具有词频率逆文档权重的词袋模型。生成的Document-Term矩阵用于训练分类器,然后从剩余的评论中识别可疑活动。因此,这种方法利用高精度、低召回率的数据集,从大型低精度、高召回率的数据集中识别相关实例。该方法在来自Yelp在线论坛的456,050条评论上进行了评估,使用了各种机器学习算法和不同数量的文本特征。使用随机森林分类器,该方法的f1得分为0.77。对于一个紧凑的分类器,文本特征的数量也可以从1473减少到447,而准确率只有很小的下降。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Attention-Guided Generative Adversarial Network to Address Atypical Anatomy in Synthetic CT Generation. Natural Language-based Integration of Online Review Datasets for Identification of Sex Trafficking Businesses. An Adaptive and Dynamic Biosensor Epidemic Model for COVID-19 Relating the Empirical Foundations of Attack Generation and Vulnerability Discovery Latent Feature Modelling for Recommender Systems
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1