{"title":"Natural Language-based Integration of Online Review Datasets for Identification of Sex Trafficking Businesses.","authors":"Maria Diaz, Anand Panangadan","doi":"10.1109/iri49571.2020.00044","DOIUrl":null,"url":null,"abstract":"<p><p>There is increasing interest in automatically identifying advertisements related to sex trafficking in online review sites. The main challenge is to identify the changing patterns in text reviews that are used to indicate illegal businesses. This work describes a novel means of identifying illegal business advertisements using natural language processing and machine learning. The method relies on building a training set of reviews of known illegal businesses. This training data is created by integrating a small high precision set of known illegal businesses (Rubmaps) with a large collection of online reviews from a general purpose review site (Yelp). Standard natural language pre-processing techniques are then applied to the text reviews and converted into a bag-of-words model with Term frequency-inverse document weighting. The resulting Document-Term matrix is used to train a classifier and then to identify suspicious activity from the remaining reviews. This approach therefore leverages a high-precision, low-recall dataset to identify relevant instances from the large low-precision, high-recall dataset. The approach was evaluated on a collection of 456,050 reviews from the Yelp online forum with a variety of machine learning algorithms and different number of text features. The method achieved a f1-score of 0.77 with a random forests classifier. The number of text features could also be reduced from 1,473 to 447 for a compact classifier with only a small drop in accuracy.</p>","PeriodicalId":93159,"journal":{"name":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","volume":"2020 ","pages":"259-264"},"PeriodicalIF":0.0000,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/iri49571.2020.00044","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/iri49571.2020.00044","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2020/9/10 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9
Abstract
There is increasing interest in automatically identifying advertisements related to sex trafficking in online review sites. The main challenge is to identify the changing patterns in text reviews that are used to indicate illegal businesses. This work describes a novel means of identifying illegal business advertisements using natural language processing and machine learning. The method relies on building a training set of reviews of known illegal businesses. This training data is created by integrating a small high precision set of known illegal businesses (Rubmaps) with a large collection of online reviews from a general purpose review site (Yelp). Standard natural language pre-processing techniques are then applied to the text reviews and converted into a bag-of-words model with Term frequency-inverse document weighting. The resulting Document-Term matrix is used to train a classifier and then to identify suspicious activity from the remaining reviews. This approach therefore leverages a high-precision, low-recall dataset to identify relevant instances from the large low-precision, high-recall dataset. The approach was evaluated on a collection of 456,050 reviews from the Yelp online forum with a variety of machine learning algorithms and different number of text features. The method achieved a f1-score of 0.77 with a random forests classifier. The number of text features could also be reduced from 1,473 to 447 for a compact classifier with only a small drop in accuracy.