Rahul Tripathi, Balaji Dhamodharaswamy, S. Jagannathan, Abhishek Nandi
Spoken language can include sensitive topics including profanity, insults, political and offensive speech. In order to engage in contextually appropriate conversations, it is essential for voice services such as Alexa, Google Assistant, Siri, etc. to detect sensitive topics in the conversations and react appropriately. A simple approach to detect sensitive topics is to use regular expression or keyword based rules. However, keyword based rules have several drawbacks: (1) coverage (recall) depends on the exhaustiveness of the keywords, and (2) rules do not scale and generalize well even for minor variations of the keywords. Machine learning (ML) approaches provide the potential benefit of generalization, but require large volumes of training data, which is difficult to obtain for sparse data problems. This paper describes: (1) a ML based solution that uses training data (2.1M dataset), obtained from synthetic generation and semi-supervised learning techniques, to detect sensitive content in spoken language; and (2) the results of evaluating its performance on several million test instances of live utterances. The results show that our ML models have very high precision (>>90%). Moreover, in spite of relying on synthetic training data, the ML models are able to generalize beyond the training data to identify significantly higher amounts (~2x for Logistic Regression, and ~4x-6x for a Neural Network models such as Bi-LSTM and CNN) of the test stream as sensitive in comparison to a baseline approach using the training data (~ 1 Million examples) as rules. We are able to train our models with very few manual annotations. The percentage share of sensitive examples in our training dataset from synthetic generation using templates and manual annotations are 98.04% and 1.96%, respectively. The percentage share of non-sensitive examples in our training dataset from synthetic generation using templates, automated labeling via semi-supervised techniques, and manual annotations are 15.35%, 83.75%, and 0.90%, respectively. The neural network models (Bi-LSTM and CNN) also use lower memory footprint (22.5% lower than baseline and 80% lower than Logistic Regression) while giving improved accuracy.
口语可能包含敏感话题,包括亵渎、侮辱、政治和攻击性言论。为了进行上下文适当的对话,语音服务(如Alexa, Google Assistant, Siri等)必须检测对话中的敏感话题并做出适当的反应。检测敏感主题的一种简单方法是使用正则表达式或基于关键字的规则。然而,基于关键字的规则有几个缺点:(1)覆盖(召回)取决于关键字的穷尽性,(2)规则不能很好地扩展和概括,即使是关键字的微小变化。机器学习(ML)方法提供了泛化的潜在好处,但需要大量的训练数据,这对于稀疏数据问题很难获得。本文描述了:(1)一种基于机器学习的解决方案,该解决方案使用合成生成和半监督学习技术获得的训练数据(2.1M数据集)来检测口语中的敏感内容;(2)在数百万个实时语音测试实例上的性能评估结果。结果表明,我们的机器学习模型具有很高的精度(>>90%)。此外,尽管依赖于合成训练数据,与使用训练数据(~ 100万个示例)作为规则的基线方法相比,机器学习模型能够在训练数据之外进行推广,以识别明显更高的测试流数量(逻辑回归为~2x, Bi-LSTM和CNN等神经网络模型为~4 -6x)。我们可以用很少的手工注释来训练我们的模型。在我们的训练数据集中,使用模板和手动注释合成生成的敏感示例的百分比分别为98.04%和1.96%。在我们的训练数据集中,使用模板合成生成、通过半监督技术自动标记和手动注释的非敏感示例所占的百分比分别为15.35%、83.75%和0.90%。神经网络模型(Bi-LSTM和CNN)也使用更低的内存占用(比基线低22.5%,比Logistic回归低80%),同时提高了准确性。
{"title":"Detecting Sensitive Content in Spoken Language","authors":"Rahul Tripathi, Balaji Dhamodharaswamy, S. Jagannathan, Abhishek Nandi","doi":"10.1109/DSAA.2019.00052","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00052","url":null,"abstract":"Spoken language can include sensitive topics including profanity, insults, political and offensive speech. In order to engage in contextually appropriate conversations, it is essential for voice services such as Alexa, Google Assistant, Siri, etc. to detect sensitive topics in the conversations and react appropriately. A simple approach to detect sensitive topics is to use regular expression or keyword based rules. However, keyword based rules have several drawbacks: (1) coverage (recall) depends on the exhaustiveness of the keywords, and (2) rules do not scale and generalize well even for minor variations of the keywords. Machine learning (ML) approaches provide the potential benefit of generalization, but require large volumes of training data, which is difficult to obtain for sparse data problems. This paper describes: (1) a ML based solution that uses training data (2.1M dataset), obtained from synthetic generation and semi-supervised learning techniques, to detect sensitive content in spoken language; and (2) the results of evaluating its performance on several million test instances of live utterances. The results show that our ML models have very high precision (>>90%). Moreover, in spite of relying on synthetic training data, the ML models are able to generalize beyond the training data to identify significantly higher amounts (~2x for Logistic Regression, and ~4x-6x for a Neural Network models such as Bi-LSTM and CNN) of the test stream as sensitive in comparison to a baseline approach using the training data (~ 1 Million examples) as rules. We are able to train our models with very few manual annotations. The percentage share of sensitive examples in our training dataset from synthetic generation using templates and manual annotations are 98.04% and 1.96%, respectively. The percentage share of non-sensitive examples in our training dataset from synthetic generation using templates, automated labeling via semi-supervised techniques, and manual annotations are 15.35%, 83.75%, and 0.90%, respectively. The neural network models (Bi-LSTM and CNN) also use lower memory footprint (22.5% lower than baseline and 80% lower than Logistic Regression) while giving improved accuracy.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114540093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Automation of Big Data Analysis (BDA) procedure gives us a great profit in the era of Big Data and Artificial Intelligence. BDA procedure can be efficiently automated by the automatic service composition concept efficiently. Our previous work for Auto-BDA shows a great future prospect in reducing turnaround time for data analysis. Moreover, it requires consideration of the automation with a well-geared combination of the data preparation and the optimal model (deep learning) generation. This paper shows the construction of automating BDA and model generation (here deep learning) together with data preparation and parameter optimization.
{"title":"Automating Big Data Analysis Based on Deep Learning Generation by Automatic Service Composition","authors":"Incheon Paik, T. Siriweera","doi":"10.1109/DSAA.2019.00081","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00081","url":null,"abstract":"Automation of Big Data Analysis (BDA) procedure gives us a great profit in the era of Big Data and Artificial Intelligence. BDA procedure can be efficiently automated by the automatic service composition concept efficiently. Our previous work for Auto-BDA shows a great future prospect in reducing turnaround time for data analysis. Moreover, it requires consideration of the automation with a well-geared combination of the data preparation and the optimal model (deep learning) generation. This paper shows the construction of automating BDA and model generation (here deep learning) together with data preparation and parameter optimization.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114483648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Provides an abstract for each of the tutorial presentations and may include a brief professional biography of each presenter. The complete presentations were not made available for publication as part of the conference proceedings.
{"title":"Tutorials","authors":"Emre Kıcıman","doi":"10.1109/dsaa.2019.00013","DOIUrl":"https://doi.org/10.1109/dsaa.2019.00013","url":null,"abstract":"Provides an abstract for each of the tutorial presentations and may include a brief professional biography of each presenter. The complete presentations were not made available for publication as part of the conference proceedings.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124190725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chainarong Amornbunchornvej, E. Zheleva, T. Berger-Wolf
Granger causality is a fundamental technique for causal inference in time series data, commonly used in the social and biological sciences. Typical operationalizations of Granger causality make a strong assumption that every time point of the effect time series is influenced by a combination of other time series with a fixed time delay. However, the assumption of the fixed time delay does not hold in many applications, such as collective behavior, financial markets, and many natural phenomena. To address this issue, we develop variable-lag Granger causality, a generalization of Granger causality that relaxes the assumption of the fixed time delay and allows causes to influence effects with arbitrary time delays. In addition, we propose a method for inferring variable-lag Granger causality relations. We demonstrate our approach on an application for studying coordinated collective behavior and show that it performs better than several existing methods in both simulated and real-world datasets. Our approach can be applied in any domain of time series analysis.
{"title":"Variable-Lag Granger Causality for Time Series Analysis","authors":"Chainarong Amornbunchornvej, E. Zheleva, T. Berger-Wolf","doi":"10.1109/DSAA.2019.00016","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00016","url":null,"abstract":"Granger causality is a fundamental technique for causal inference in time series data, commonly used in the social and biological sciences. Typical operationalizations of Granger causality make a strong assumption that every time point of the effect time series is influenced by a combination of other time series with a fixed time delay. However, the assumption of the fixed time delay does not hold in many applications, such as collective behavior, financial markets, and many natural phenomena. To address this issue, we develop variable-lag Granger causality, a generalization of Granger causality that relaxes the assumption of the fixed time delay and allows causes to influence effects with arbitrary time delays. In addition, we propose a method for inferring variable-lag Granger causality relations. We demonstrate our approach on an application for studying coordinated collective behavior and show that it performs better than several existing methods in both simulated and real-world datasets. Our approach can be applied in any domain of time series analysis.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124391653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Artificial neural networks (ANN) is an automatic way of capturing linear and nonlinear correlations, spatial and other structural dependence among features and output variables. This results good performance in many application areas such as classification and prediction from magnetic resonance imaging, spatial data and computer vision tasks. Most commonly used ANNs assume availability of large training data compared to dimension of feature vector. However, in modern days applications, such as MRI applications or in computer visions the training sample sizes are often low, and may be even lower than the dimension of feature vector. In this paper, we consider a single layer ANN classification model that is suitable for low training sample. Besides developing the sparse architecture, we also studied theoretical properties of our machine. We showed that under mild conditions, the classification risk converges to the optimal Bayes classifier risk (universal consistency) under sparse group lasso regularization. Moreover, we proposed a variation on the regularization terms. A few examples in popular research fields are also provided to illustrate the theory and methods.
{"title":"On the Classification Consistency of High-Dimensional Sparse Neural Network","authors":"Kaixu Yang, T. Maiti","doi":"10.1109/DSAA.2019.00032","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00032","url":null,"abstract":"Artificial neural networks (ANN) is an automatic way of capturing linear and nonlinear correlations, spatial and other structural dependence among features and output variables. This results good performance in many application areas such as classification and prediction from magnetic resonance imaging, spatial data and computer vision tasks. Most commonly used ANNs assume availability of large training data compared to dimension of feature vector. However, in modern days applications, such as MRI applications or in computer visions the training sample sizes are often low, and may be even lower than the dimension of feature vector. In this paper, we consider a single layer ANN classification model that is suitable for low training sample. Besides developing the sparse architecture, we also studied theoretical properties of our machine. We showed that under mild conditions, the classification risk converges to the optimal Bayes classifier risk (universal consistency) under sparse group lasso regularization. Moreover, we proposed a variation on the regularization terms. A few examples in popular research fields are also provided to illustrate the theory and methods.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125047892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adnene Belfodil, Aimene Belfodil, Anes Bendimerad, Philippe Lamarre, C. Robardet, Mehdi Kaytoue-Uberall, M. Plantevit
Subgroup discovery (SD) is the task of discovering interpretable patterns in the data that stand out w.r.t. some property of interest. Discovering patterns that accurately discriminate a class from the others is one of the most common SD tasks. Standard approaches of the literature are based on local pattern discovery, which is known to provide an overwhelmingly large number of redundant patterns. To solve this issue, pattern set mining has been proposed: instead of evaluating the quality of patterns separately, one should consider the quality of a pattern set as a whole. The goal is to provide a small pattern set that is diverse and well-discriminant to~the target class. In this work, we introduce a novel formulation of the task of diverse subgroup set discovery where both discriminative power and diversity of the subgroup set are incorporated in the same quality measure. We propose an efficient and parameter-free algorithm dubbed FSSD and based on a greedy scheme. FSSD uses several optimization strategies that enable to efficiently provide a high quality pattern set in a short amount of time.
{"title":"FSSD - A Fast and Efficient Algorithm for Subgroup Set Discovery","authors":"Adnene Belfodil, Aimene Belfodil, Anes Bendimerad, Philippe Lamarre, C. Robardet, Mehdi Kaytoue-Uberall, M. Plantevit","doi":"10.1109/DSAA.2019.00023","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00023","url":null,"abstract":"Subgroup discovery (SD) is the task of discovering interpretable patterns in the data that stand out w.r.t. some property of interest. Discovering patterns that accurately discriminate a class from the others is one of the most common SD tasks. Standard approaches of the literature are based on local pattern discovery, which is known to provide an overwhelmingly large number of redundant patterns. To solve this issue, pattern set mining has been proposed: instead of evaluating the quality of patterns separately, one should consider the quality of a pattern set as a whole. The goal is to provide a small pattern set that is diverse and well-discriminant to~the target class. In this work, we introduce a novel formulation of the task of diverse subgroup set discovery where both discriminative power and diversity of the subgroup set are incorporated in the same quality measure. We propose an efficient and parameter-free algorithm dubbed FSSD and based on a greedy scheme. FSSD uses several optimization strategies that enable to efficiently provide a high quality pattern set in a short amount of time.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130760050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We show that Residual Networks (ResNet) is equivalent to boosting feature representation, without any modification to the underlying ResNet training algorithm. A regret bound based on Online Gradient Boosting theory is proved and suggests that ResNet could achieve Online Gradient Boosting regret bounds through neural network architectural changes with the addition of a shrinkage parameter in the identity skip-connections and using residual modules with max-norm bounds. Through this relation between ResNet and Online Boosting, novel feature representation boosting algorithms can be constructed based on altering residual modules. We demonstrate this through proposing decision tree residual modules to construct a new boosted decision tree algorithm and demonstrating generalization error bounds for both approaches; relaxing constraints within BoostResNet algorithm to allow it to be trained in an out-of-core manner. We evaluate convolution ResNet with and without shrinkage modifications to demonstrate its efficacy, and demonstrate that our online boosted decision tree algorithm is comparable to state-of-the-art offline boosted decision tree algorithms without the drawback of offline approaches.
{"title":"Residual Networks Behave Like Boosting Algorithms","authors":"Chapman Siu","doi":"10.1109/DSAA.2019.00017","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00017","url":null,"abstract":"We show that Residual Networks (ResNet) is equivalent to boosting feature representation, without any modification to the underlying ResNet training algorithm. A regret bound based on Online Gradient Boosting theory is proved and suggests that ResNet could achieve Online Gradient Boosting regret bounds through neural network architectural changes with the addition of a shrinkage parameter in the identity skip-connections and using residual modules with max-norm bounds. Through this relation between ResNet and Online Boosting, novel feature representation boosting algorithms can be constructed based on altering residual modules. We demonstrate this through proposing decision tree residual modules to construct a new boosted decision tree algorithm and demonstrating generalization error bounds for both approaches; relaxing constraints within BoostResNet algorithm to allow it to be trained in an out-of-core manner. We evaluate convolution ResNet with and without shrinkage modifications to demonstrate its efficacy, and demonstrate that our online boosted decision tree algorithm is comparable to state-of-the-art offline boosted decision tree algorithms without the drawback of offline approaches.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130861679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Al Amin Hosain, P. Santhalingam, P. Pathak, J. Kosecka, H. Rangwala
Voice-controlled personal and home assistants (such as the Amazon Echo and Apple Siri) are becoming increasingly popular for a variety of applications. However, the benefits of these technologies are not readily accessible to Deaf or Hard-ofHearing (DHH) users. The objective of this study is to develop and evaluate a sign recognition system using multiple modalities that can be used by DHH signers to interact with voice-controlled devices. With the advancement of depth sensors, skeletal data is used for applications like video analysis and activity recognition. Despite having similarity with the well-studied human activity recognition, the use of 3D skeleton data in sign language recognition is rare. This is because unlike activity recognition, sign language is mostly dependent on hand shape pattern. In this work, we investigate the feasibility of using skeletal and RGB video data for sign language recognition using a combination of different deep learning architectures. We validate our results on a large-scale American Sign Language (ASL) dataset of 12 users and 13107 samples across 51 signs. It is named as GMUASL51. 1 We collected the dataset over 6 months and it will be publicly released in the hope of spurring further machine learning research towards providing improved accessibility for digital assistants.
{"title":"Sign Language Recognition Analysis using Multimodal Data","authors":"Al Amin Hosain, P. Santhalingam, P. Pathak, J. Kosecka, H. Rangwala","doi":"10.1109/DSAA.2019.00035","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00035","url":null,"abstract":"Voice-controlled personal and home assistants (such as the Amazon Echo and Apple Siri) are becoming increasingly popular for a variety of applications. However, the benefits of these technologies are not readily accessible to Deaf or Hard-ofHearing (DHH) users. The objective of this study is to develop and evaluate a sign recognition system using multiple modalities that can be used by DHH signers to interact with voice-controlled devices. With the advancement of depth sensors, skeletal data is used for applications like video analysis and activity recognition. Despite having similarity with the well-studied human activity recognition, the use of 3D skeleton data in sign language recognition is rare. This is because unlike activity recognition, sign language is mostly dependent on hand shape pattern. In this work, we investigate the feasibility of using skeletal and RGB video data for sign language recognition using a combination of different deep learning architectures. We validate our results on a large-scale American Sign Language (ASL) dataset of 12 users and 13107 samples across 51 signs. It is named as GMUASL51. 1 We collected the dataset over 6 months and it will be publicly released in the hope of spurring further machine learning research towards providing improved accessibility for digital assistants.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"178 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114096121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Given the reach of web platforms, bad actors have considerable incentives to manipulate and defraud users at the expense of platform integrity. This has spurred research in numerous suspicious behavior detection tasks, including detection of sybil accounts, false information, and payment scams/fraud. In this paper, we draw the insight that many such initiatives can be tackled in a common framework by posing a detection task which seeks to find groups of entities which share too many properties with one another across multiple attributes (sybil accounts created at the same time and location, propaganda spreaders broadcasting articles with the same rhetoric and with similar reshares, etc.) Our work makes four core contributions: Firstly, we posit a novel formulation of this task as a multi-view graph mining problem, in which distinct views reflect distinct attribute similarities across entities, and contextual similarity and attribute importance are respected. Secondly, we propose a novel suspiciousness metric for scoring entity groups given the abnormality of their synchronicity across multiple views, which obeys intuitive desiderata that existing metrics do not. Finally, we propose the SliceNDice algorithm which enables efficient extraction of highly suspicious entity groups, and demonstrate its practicality in production, in terms of strong detection performance and discoveries on Snapchat's large advertiser ecosystem (89% precision and numerous discoveries of real fraud rings), marked outperformance of baselines (over 97% precision/recall in simulated settings) and linear scalability.
{"title":"SliceNDice: Mining Suspicious Multi-Attribute Entity Groups with Multi-View Graphs","authors":"H. Nilforoshan, Neil Shah","doi":"10.1109/dsaa.2019.00050","DOIUrl":"https://doi.org/10.1109/dsaa.2019.00050","url":null,"abstract":"Given the reach of web platforms, bad actors have considerable incentives to manipulate and defraud users at the expense of platform integrity. This has spurred research in numerous suspicious behavior detection tasks, including detection of sybil accounts, false information, and payment scams/fraud. In this paper, we draw the insight that many such initiatives can be tackled in a common framework by posing a detection task which seeks to find groups of entities which share too many properties with one another across multiple attributes (sybil accounts created at the same time and location, propaganda spreaders broadcasting articles with the same rhetoric and with similar reshares, etc.) Our work makes four core contributions: Firstly, we posit a novel formulation of this task as a multi-view graph mining problem, in which distinct views reflect distinct attribute similarities across entities, and contextual similarity and attribute importance are respected. Secondly, we propose a novel suspiciousness metric for scoring entity groups given the abnormality of their synchronicity across multiple views, which obeys intuitive desiderata that existing metrics do not. Finally, we propose the SliceNDice algorithm which enables efficient extraction of highly suspicious entity groups, and demonstrate its practicality in production, in terms of strong detection performance and discoveries on Snapchat's large advertiser ecosystem (89% precision and numerous discoveries of real fraud rings), marked outperformance of baselines (over 97% precision/recall in simulated settings) and linear scalability.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126284338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In machine learning applications for online product offerings and marketing strategies, there are often hundreds or thousands of features available to build such models. Feature selection is one essential method in such applications for multiple objectives: improving the prediction accuracy by eliminating irrelevant features, accelerating the model training and prediction speed, reducing the monitoring and maintenance workload for feature data pipeline, and providing better model interpretation and diagnosis capability. However, selecting an optimal feature subset from a large feature space is considered as an NP-complete problem. The mRMR (Minimum Redundancy and Maximum Relevance) feature selection framework solves this problem by selecting the relevant features while controlling for the redundancy within the selected features. This paper describes the approach to extend, evaluate, and implement the mRMR feature selection methods for classification problem in a marketing machine learning platform at Uber that automates creation and deployment of targeting and personalization models at scale. This study first extends the existing mRMR methods by introducing a non-linear feature redundancy measure and a model-based feature relevance measure. Then an extensive empirical evaluation is performed for eight different feature selection methods, using one synthetic dataset and three real-world marketing datasets at Uber to cover different use cases. Based on the empirical results, the selected mRMR method is implemented in production for the marketing machine learning platform. A description of the production implementation is provided and an online experiment deployed through the platform is discussed.
{"title":"Maximum Relevance and Minimum Redundancy Feature Selection Methods for a Marketing Machine Learning Platform","authors":"Zhenyu Zhao, Radhika Anand, Mallory Wang","doi":"10.1109/dsaa.2019.00059","DOIUrl":"https://doi.org/10.1109/dsaa.2019.00059","url":null,"abstract":"In machine learning applications for online product offerings and marketing strategies, there are often hundreds or thousands of features available to build such models. Feature selection is one essential method in such applications for multiple objectives: improving the prediction accuracy by eliminating irrelevant features, accelerating the model training and prediction speed, reducing the monitoring and maintenance workload for feature data pipeline, and providing better model interpretation and diagnosis capability. However, selecting an optimal feature subset from a large feature space is considered as an NP-complete problem. The mRMR (Minimum Redundancy and Maximum Relevance) feature selection framework solves this problem by selecting the relevant features while controlling for the redundancy within the selected features. This paper describes the approach to extend, evaluate, and implement the mRMR feature selection methods for classification problem in a marketing machine learning platform at Uber that automates creation and deployment of targeting and personalization models at scale. This study first extends the existing mRMR methods by introducing a non-linear feature redundancy measure and a model-based feature relevance measure. Then an extensive empirical evaluation is performed for eight different feature selection methods, using one synthetic dataset and three real-world marketing datasets at Uber to cover different use cases. Based on the empirical results, the selected mRMR method is implemented in production for the marketing machine learning platform. A description of the production implementation is provided and an online experiment deployed through the platform is discussed.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115357387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}