2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)最新文献

英文中文

Detecting Sensitive Content in Spoken Language 检测口语中的敏感内容

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2019-10-01 DOI: 10.1109/DSAA.2019.00052

Rahul Tripathi, Balaji Dhamodharaswamy, S. Jagannathan, Abhishek Nandi

Spoken language can include sensitive topics including profanity, insults, political and offensive speech. In order to engage in contextually appropriate conversations, it is essential for voice services such as Alexa, Google Assistant, Siri, etc. to detect sensitive topics in the conversations and react appropriately. A simple approach to detect sensitive topics is to use regular expression or keyword based rules. However, keyword based rules have several drawbacks: (1) coverage (recall) depends on the exhaustiveness of the keywords, and (2) rules do not scale and generalize well even for minor variations of the keywords. Machine learning (ML) approaches provide the potential benefit of generalization, but require large volumes of training data, which is difficult to obtain for sparse data problems. This paper describes: (1) a ML based solution that uses training data (2.1M dataset), obtained from synthetic generation and semi-supervised learning techniques, to detect sensitive content in spoken language; and (2) the results of evaluating its performance on several million test instances of live utterances. The results show that our ML models have very high precision (>>90%). Moreover, in spite of relying on synthetic training data, the ML models are able to generalize beyond the training data to identify significantly higher amounts (~2x for Logistic Regression, and ~4x-6x for a Neural Network models such as Bi-LSTM and CNN) of the test stream as sensitive in comparison to a baseline approach using the training data (~ 1 Million examples) as rules. We are able to train our models with very few manual annotations. The percentage share of sensitive examples in our training dataset from synthetic generation using templates and manual annotations are 98.04% and 1.96%, respectively. The percentage share of non-sensitive examples in our training dataset from synthetic generation using templates, automated labeling via semi-supervised techniques, and manual annotations are 15.35%, 83.75%, and 0.90%, respectively. The neural network models (Bi-LSTM and CNN) also use lower memory footprint (22.5% lower than baseline and 80% lower than Logistic Regression) while giving improved accuracy.

口语可能包含敏感话题，包括亵渎、侮辱、政治和攻击性言论。为了进行上下文适当的对话，语音服务(如Alexa, Google Assistant, Siri等)必须检测对话中的敏感话题并做出适当的反应。检测敏感主题的一种简单方法是使用正则表达式或基于关键字的规则。然而，基于关键字的规则有几个缺点:(1)覆盖(召回)取决于关键字的穷尽性，(2)规则不能很好地扩展和概括，即使是关键字的微小变化。机器学习(ML)方法提供了泛化的潜在好处，但需要大量的训练数据，这对于稀疏数据问题很难获得。本文描述了:(1)一种基于机器学习的解决方案，该解决方案使用合成生成和半监督学习技术获得的训练数据(2.1M数据集)来检测口语中的敏感内容;(2)在数百万个实时语音测试实例上的性能评估结果。结果表明，我们的机器学习模型具有很高的精度(>>90%)。此外，尽管依赖于合成训练数据，与使用训练数据(~ 100万个示例)作为规则的基线方法相比，机器学习模型能够在训练数据之外进行推广，以识别明显更高的测试流数量(逻辑回归为~2x, Bi-LSTM和CNN等神经网络模型为~4 -6x)。我们可以用很少的手工注释来训练我们的模型。在我们的训练数据集中，使用模板和手动注释合成生成的敏感示例的百分比分别为98.04%和1.96%。在我们的训练数据集中，使用模板合成生成、通过半监督技术自动标记和手动注释的非敏感示例所占的百分比分别为15.35%、83.75%和0.90%。神经网络模型(Bi-LSTM和CNN)也使用更低的内存占用(比基线低22.5%，比Logistic回归低80%)，同时提高了准确性。

{"title":"Detecting Sensitive Content in Spoken Language","authors":"Rahul Tripathi, Balaji Dhamodharaswamy, S. Jagannathan, Abhishek Nandi","doi":"10.1109/DSAA.2019.00052","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00052","url":null,"abstract":"Spoken language can include sensitive topics including profanity, insults, political and offensive speech. In order to engage in contextually appropriate conversations, it is essential for voice services such as Alexa, Google Assistant, Siri, etc. to detect sensitive topics in the conversations and react appropriately. A simple approach to detect sensitive topics is to use regular expression or keyword based rules. However, keyword based rules have several drawbacks: (1) coverage (recall) depends on the exhaustiveness of the keywords, and (2) rules do not scale and generalize well even for minor variations of the keywords. Machine learning (ML) approaches provide the potential benefit of generalization, but require large volumes of training data, which is difficult to obtain for sparse data problems. This paper describes: (1) a ML based solution that uses training data (2.1M dataset), obtained from synthetic generation and semi-supervised learning techniques, to detect sensitive content in spoken language; and (2) the results of evaluating its performance on several million test instances of live utterances. The results show that our ML models have very high precision (>>90%). Moreover, in spite of relying on synthetic training data, the ML models are able to generalize beyond the training data to identify significantly higher amounts (~2x for Logistic Regression, and ~4x-6x for a Neural Network models such as Bi-LSTM and CNN) of the test stream as sensitive in comparison to a baseline approach using the training data (~ 1 Million examples) as rules. We are able to train our models with very few manual annotations. The percentage share of sensitive examples in our training dataset from synthetic generation using templates and manual annotations are 98.04% and 1.96%, respectively. The percentage share of non-sensitive examples in our training dataset from synthetic generation using templates, automated labeling via semi-supervised techniques, and manual annotations are 15.35%, 83.75%, and 0.90%, respectively. The neural network models (Bi-LSTM and CNN) also use lower memory footprint (22.5% lower than baseline and 80% lower than Logistic Regression) while giving improved accuracy.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114540093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Automating Big Data Analysis Based on Deep Learning Generation by Automatic Service Composition 基于深度学习生成的自动化服务组合大数据分析

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2019-10-01 DOI: 10.1109/DSAA.2019.00081

Incheon Paik, T. Siriweera

Automation of Big Data Analysis (BDA) procedure gives us a great profit in the era of Big Data and Artificial Intelligence. BDA procedure can be efficiently automated by the automatic service composition concept efficiently. Our previous work for Auto-BDA shows a great future prospect in reducing turnaround time for data analysis. Moreover, it requires consideration of the automation with a well-geared combination of the data preparation and the optimal model (deep learning) generation. This paper shows the construction of automating BDA and model generation (here deep learning) together with data preparation and parameter optimization.

在大数据和人工智能时代，大数据分析(BDA)过程的自动化给我们带来了巨大的利润。利用自动服务组合的概念，可以有效地实现BDA流程的自动化。我们之前为Auto-BDA所做的工作在减少数据分析的周转时间方面显示出了巨大的前景。此外，它需要考虑数据准备和最优模型(深度学习)生成的良好组合的自动化。本文展示了自动化BDA的构建和模型生成(这里是深度学习)，以及数据准备和参数优化。

引用次数: 3

Tutorials 教程

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2019-10-01 DOI: 10.1109/dsaa.2019.00013

Emre Kıcıman

Provides an abstract for each of the tutorial presentations and may include a brief professional biography of each presenter. The complete presentations were not made available for publication as part of the conference proceedings.

提供每个教程演示文稿的摘要，并可能包括每个演示文稿的简短专业简介。完整的发言没有作为会议记录的一部分提供出版。

引用次数: 0

Variable-Lag Granger Causality for Time Series Analysis 时间序列分析的变量滞后格兰杰因果关系

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2019-10-01 DOI: 10.1109/DSAA.2019.00016

Chainarong Amornbunchornvej, E. Zheleva, T. Berger-Wolf

Granger causality is a fundamental technique for causal inference in time series data, commonly used in the social and biological sciences. Typical operationalizations of Granger causality make a strong assumption that every time point of the effect time series is influenced by a combination of other time series with a fixed time delay. However, the assumption of the fixed time delay does not hold in many applications, such as collective behavior, financial markets, and many natural phenomena. To address this issue, we develop variable-lag Granger causality, a generalization of Granger causality that relaxes the assumption of the fixed time delay and allows causes to influence effects with arbitrary time delays. In addition, we propose a method for inferring variable-lag Granger causality relations. We demonstrate our approach on an application for studying coordinated collective behavior and show that it performs better than several existing methods in both simulated and real-world datasets. Our approach can be applied in any domain of time series analysis.

格兰杰因果关系是对时间序列数据进行因果推理的一种基本技术，常用于社会科学和生物科学。典型的格兰杰因果关系运算化强烈假设效应时间序列的每一个时间点都受到具有固定时滞的其他时间序列组合的影响。然而，在许多应用中，如集体行为、金融市场和许多自然现象中，固定时滞的假设并不成立。为了解决这个问题，我们发展了可变滞后格兰杰因果关系，这是格兰杰因果关系的一种推广，它放宽了固定时间延迟的假设，并允许原因影响任意时间延迟的结果。此外，我们还提出了一种推断变量滞后格兰杰因果关系的方法。我们在研究协调集体行为的应用程序中展示了我们的方法，并表明它在模拟和现实世界数据集中都比现有的几种方法表现得更好。我们的方法可以应用于时间序列分析的任何领域。

引用次数: 17

On the Classification Consistency of High-Dimensional Sparse Neural Network 高维稀疏神经网络分类一致性研究

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2019-10-01 DOI: 10.1109/DSAA.2019.00032

Kaixu Yang, T. Maiti

Artificial neural networks (ANN) is an automatic way of capturing linear and nonlinear correlations, spatial and other structural dependence among features and output variables. This results good performance in many application areas such as classification and prediction from magnetic resonance imaging, spatial data and computer vision tasks. Most commonly used ANNs assume availability of large training data compared to dimension of feature vector. However, in modern days applications, such as MRI applications or in computer visions the training sample sizes are often low, and may be even lower than the dimension of feature vector. In this paper, we consider a single layer ANN classification model that is suitable for low training sample. Besides developing the sparse architecture, we also studied theoretical properties of our machine. We showed that under mild conditions, the classification risk converges to the optimal Bayes classifier risk (universal consistency) under sparse group lasso regularization. Moreover, we proposed a variation on the regularization terms. A few examples in popular research fields are also provided to illustrate the theory and methods.

人工神经网络(ANN)是一种自动捕获特征和输出变量之间的线性和非线性相关性、空间和其他结构依赖性的方法。这在磁共振成像、空间数据和计算机视觉任务的分类和预测等许多应用领域都取得了良好的性能。与特征向量的维度相比，最常用的人工神经网络假设训练数据的可用性较大。然而，在现代应用中，如MRI应用或计算机视觉中，训练样本大小通常很低，甚至可能低于特征向量的维数。在本文中，我们考虑了一种适合于低训练样本的单层人工神经网络分类模型。除了开发稀疏架构，我们还研究了我们的机器的理论性质。结果表明，在温和条件下，分类风险收敛于稀疏群套索正则化下的最优贝叶斯分类器风险(普遍一致性)。此外，我们还提出了正则化项的一种变体。本文还列举了一些热门研究领域的实例来说明这一理论和方法。

{"title":"On the Classification Consistency of High-Dimensional Sparse Neural Network","authors":"Kaixu Yang, T. Maiti","doi":"10.1109/DSAA.2019.00032","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00032","url":null,"abstract":"Artificial neural networks (ANN) is an automatic way of capturing linear and nonlinear correlations, spatial and other structural dependence among features and output variables. This results good performance in many application areas such as classification and prediction from magnetic resonance imaging, spatial data and computer vision tasks. Most commonly used ANNs assume availability of large training data compared to dimension of feature vector. However, in modern days applications, such as MRI applications or in computer visions the training sample sizes are often low, and may be even lower than the dimension of feature vector. In this paper, we consider a single layer ANN classification model that is suitable for low training sample. Besides developing the sparse architecture, we also studied theoretical properties of our machine. We showed that under mild conditions, the classification risk converges to the optimal Bayes classifier risk (universal consistency) under sparse group lasso regularization. Moreover, we proposed a variation on the regularization terms. A few examples in popular research fields are also provided to illustrate the theory and methods.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125047892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FSSD - A Fast and Efficient Algorithm for Subgroup Set Discovery FSSD - 用于发现子群集的快速高效算法

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2019-10-01 DOI: 10.1109/DSAA.2019.00023

Adnene Belfodil, Aimene Belfodil, Anes Bendimerad, Philippe Lamarre, C. Robardet, Mehdi Kaytoue-Uberall, M. Plantevit

Subgroup discovery (SD) is the task of discovering interpretable patterns in the data that stand out w.r.t. some property of interest. Discovering patterns that accurately discriminate a class from the others is one of the most common SD tasks. Standard approaches of the literature are based on local pattern discovery, which is known to provide an overwhelmingly large number of redundant patterns. To solve this issue, pattern set mining has been proposed: instead of evaluating the quality of patterns separately, one should consider the quality of a pattern set as a whole. The goal is to provide a small pattern set that is diverse and well-discriminant to~the target class. In this work, we introduce a novel formulation of the task of diverse subgroup set discovery where both discriminative power and diversity of the subgroup set are incorporated in the same quality measure. We propose an efficient and parameter-free algorithm dubbed FSSD and based on a greedy scheme. FSSD uses several optimization strategies that enable to efficiently provide a high quality pattern set in a short amount of time.

子群发现（SD）是在数据中发现可解释的模式的任务，这些模式在某些感兴趣的属性方面非常突出。发现能准确区分一个类别和其他类别的模式是最常见的子群发现任务之一。文献中的标准方法都是基于局部模式发现，众所周知，这种方法会提供大量冗余模式。为了解决这个问题，有人提出了模式集挖掘法：与其分别评估模式的质量，不如从整体上考虑模式集的质量。这样做的目的是提供一个小的模式集，该模式集具有多样性，并且对目标类别有很好的区分度。在这项工作中，我们对发现多样化子群集的任务提出了一种新的表述，即子群集的判别力和多样性都被纳入同一个质量度量中。我们提出了一种基于贪婪方案的高效、无参数算法，称为 FSSD。FSSD 采用多种优化策略，能在短时间内有效地提供高质量的模式集。

{"title":"FSSD - A Fast and Efficient Algorithm for Subgroup Set Discovery","authors":"Adnene Belfodil, Aimene Belfodil, Anes Bendimerad, Philippe Lamarre, C. Robardet, Mehdi Kaytoue-Uberall, M. Plantevit","doi":"10.1109/DSAA.2019.00023","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00023","url":null,"abstract":"Subgroup discovery (SD) is the task of discovering interpretable patterns in the data that stand out w.r.t. some property of interest. Discovering patterns that accurately discriminate a class from the others is one of the most common SD tasks. Standard approaches of the literature are based on local pattern discovery, which is known to provide an overwhelmingly large number of redundant patterns. To solve this issue, pattern set mining has been proposed: instead of evaluating the quality of patterns separately, one should consider the quality of a pattern set as a whole. The goal is to provide a small pattern set that is diverse and well-discriminant to~the target class. In this work, we introduce a novel formulation of the task of diverse subgroup set discovery where both discriminative power and diversity of the subgroup set are incorporated in the same quality measure. We propose an efficient and parameter-free algorithm dubbed FSSD and based on a greedy scheme. FSSD uses several optimization strategies that enable to efficiently provide a high quality pattern set in a short amount of time.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130760050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Residual Networks Behave Like Boosting Algorithms 残余网络的行为类似于增强算法

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2019-09-25 DOI: 10.1109/DSAA.2019.00017

Chapman Siu

We show that Residual Networks (ResNet) is equivalent to boosting feature representation, without any modification to the underlying ResNet training algorithm. A regret bound based on Online Gradient Boosting theory is proved and suggests that ResNet could achieve Online Gradient Boosting regret bounds through neural network architectural changes with the addition of a shrinkage parameter in the identity skip-connections and using residual modules with max-norm bounds. Through this relation between ResNet and Online Boosting, novel feature representation boosting algorithms can be constructed based on altering residual modules. We demonstrate this through proposing decision tree residual modules to construct a new boosted decision tree algorithm and demonstrating generalization error bounds for both approaches; relaxing constraints within BoostResNet algorithm to allow it to be trained in an out-of-core manner. We evaluate convolution ResNet with and without shrinkage modifications to demonstrate its efficacy, and demonstrate that our online boosted decision tree algorithm is comparable to state-of-the-art offline boosted decision tree algorithms without the drawback of offline approaches.

我们证明了残余网络(ResNet)等同于增强特征表示，而不需要对底层ResNet训练算法进行任何修改。证明了基于在线梯度增强理论的后悔界，表明ResNet可以通过改变神经网络结构，在身份跳过连接中加入收缩参数，并使用具有最大范数边界的残差模块来实现在线梯度增强后悔界。通过ResNet与在线增强之间的这种关系，可以在改变残差模块的基础上构建新的特征表示增强算法。我们通过提出决策树残差模块来构造一个新的增强决策树算法，并证明了这两种方法的泛化误差界限。放松BoostResNet算法中的约束，允许它以一种外核的方式进行训练。我们评估了有收缩修改和没有收缩修改的卷积ResNet，以证明其有效性，并证明我们的在线增强决策树算法与最先进的离线增强决策树算法相当，而没有离线方法的缺点。

{"title":"Residual Networks Behave Like Boosting Algorithms","authors":"Chapman Siu","doi":"10.1109/DSAA.2019.00017","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00017","url":null,"abstract":"We show that Residual Networks (ResNet) is equivalent to boosting feature representation, without any modification to the underlying ResNet training algorithm. A regret bound based on Online Gradient Boosting theory is proved and suggests that ResNet could achieve Online Gradient Boosting regret bounds through neural network architectural changes with the addition of a shrinkage parameter in the identity skip-connections and using residual modules with max-norm bounds. Through this relation between ResNet and Online Boosting, novel feature representation boosting algorithms can be constructed based on altering residual modules. We demonstrate this through proposing decision tree residual modules to construct a new boosted decision tree algorithm and demonstrating generalization error bounds for both approaches; relaxing constraints within BoostResNet algorithm to allow it to be trained in an out-of-core manner. We evaluate convolution ResNet with and without shrinkage modifications to demonstrate its efficacy, and demonstrate that our online boosted decision tree algorithm is comparable to state-of-the-art offline boosted decision tree algorithms without the drawback of offline approaches.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130861679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Sign Language Recognition Analysis using Multimodal Data 基于多模态数据的手语识别分析

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2019-09-24 DOI: 10.1109/DSAA.2019.00035

Al Amin Hosain, P. Santhalingam, P. Pathak, J. Kosecka, H. Rangwala

Voice-controlled personal and home assistants (such as the Amazon Echo and Apple Siri) are becoming increasingly popular for a variety of applications. However, the benefits of these technologies are not readily accessible to Deaf or Hard-ofHearing (DHH) users. The objective of this study is to develop and evaluate a sign recognition system using multiple modalities that can be used by DHH signers to interact with voice-controlled devices. With the advancement of depth sensors, skeletal data is used for applications like video analysis and activity recognition. Despite having similarity with the well-studied human activity recognition, the use of 3D skeleton data in sign language recognition is rare. This is because unlike activity recognition, sign language is mostly dependent on hand shape pattern. In this work, we investigate the feasibility of using skeletal and RGB video data for sign language recognition using a combination of different deep learning architectures. We validate our results on a large-scale American Sign Language (ASL) dataset of 12 users and 13107 samples across 51 signs. It is named as GMUASL51. 1 We collected the dataset over 6 months and it will be publicly released in the hope of spurring further machine learning research towards providing improved accessibility for digital assistants.

语音控制的个人和家庭助手(如亚马逊Echo和苹果Siri)在各种应用中越来越受欢迎。然而，这些技术的好处并不容易被聋人或重听人(DHH)用户获得。本研究的目的是开发和评估一个使用多种模式的符号识别系统，DHH签名者可以使用该系统与语音控制设备进行交互。随着深度传感器的进步，骨骼数据被用于视频分析和活动识别等应用。尽管三维骨骼数据在手语识别中的应用与已有研究的人类活动识别有相似之处，但在手语识别中却很少使用。这是因为与活动识别不同，手语主要依赖于手的形状模式。在这项工作中，我们研究了使用不同深度学习架构组合使用骨骼和RGB视频数据进行手语识别的可行性。我们在一个大规模的美国手语(ASL)数据集上验证了我们的结果，该数据集包含12个用户和13107个样本，跨越51个手势。它被命名为GMUASL51。我们收集了6个多月的数据集，并将公开发布，希望能促进进一步的机器学习研究，为数字助理提供更好的可访问性。

{"title":"Sign Language Recognition Analysis using Multimodal Data","authors":"Al Amin Hosain, P. Santhalingam, P. Pathak, J. Kosecka, H. Rangwala","doi":"10.1109/DSAA.2019.00035","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00035","url":null,"abstract":"Voice-controlled personal and home assistants (such as the Amazon Echo and Apple Siri) are becoming increasingly popular for a variety of applications. However, the benefits of these technologies are not readily accessible to Deaf or Hard-ofHearing (DHH) users. The objective of this study is to develop and evaluate a sign recognition system using multiple modalities that can be used by DHH signers to interact with voice-controlled devices. With the advancement of depth sensors, skeletal data is used for applications like video analysis and activity recognition. Despite having similarity with the well-studied human activity recognition, the use of 3D skeleton data in sign language recognition is rare. This is because unlike activity recognition, sign language is mostly dependent on hand shape pattern. In this work, we investigate the feasibility of using skeletal and RGB video data for sign language recognition using a combination of different deep learning architectures. We validate our results on a large-scale American Sign Language (ASL) dataset of 12 users and 13107 samples across 51 signs. It is named as GMUASL51. 1 We collected the dataset over 6 months and it will be publicly released in the hope of spurring further machine learning research towards providing improved accessibility for digital assistants.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"178 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114096121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

SliceNDice: Mining Suspicious Multi-Attribute Entity Groups with Multi-View Graphs SliceNDice:用多视图图挖掘可疑的多属性实体组

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2019-08-19 DOI: 10.1109/dsaa.2019.00050

H. Nilforoshan, Neil Shah

Given the reach of web platforms, bad actors have considerable incentives to manipulate and defraud users at the expense of platform integrity. This has spurred research in numerous suspicious behavior detection tasks, including detection of sybil accounts, false information, and payment scams/fraud. In this paper, we draw the insight that many such initiatives can be tackled in a common framework by posing a detection task which seeks to find groups of entities which share too many properties with one another across multiple attributes (sybil accounts created at the same time and location, propaganda spreaders broadcasting articles with the same rhetoric and with similar reshares, etc.) Our work makes four core contributions: Firstly, we posit a novel formulation of this task as a multi-view graph mining problem, in which distinct views reflect distinct attribute similarities across entities, and contextual similarity and attribute importance are respected. Secondly, we propose a novel suspiciousness metric for scoring entity groups given the abnormality of their synchronicity across multiple views, which obeys intuitive desiderata that existing metrics do not. Finally, we propose the SliceNDice algorithm which enables efficient extraction of highly suspicious entity groups, and demonstrate its practicality in production, in terms of strong detection performance and discoveries on Snapchat's large advertiser ecosystem (89% precision and numerous discoveries of real fraud rings), marked outperformance of baselines (over 97% precision/recall in simulated settings) and linear scalability.

考虑到网络平台的影响范围，不良行为者有相当大的动机以牺牲平台完整性为代价来操纵和欺骗用户。这刺激了对许多可疑行为检测任务的研究，包括检测sybil账户、虚假信息和支付诈骗/欺诈。在本文中，我们得出结论，许多此类举措可以通过提出一个检测任务在一个共同框架中解决，该任务旨在找到跨多个属性(在同一时间和地点创建的sybil帐户，宣传传播者播放具有相同修辞和相似分享的文章等)的实体组。我们的工作有四个核心贡献:首先，我们将该任务的新公式假设为多视图图挖掘问题，其中不同的视图反映了实体之间不同的属性相似性，并且尊重上下文相似性和属性重要性。其次，我们提出了一种新的怀疑度度量，用于在多个视图中对实体群体进行评分，该度量符合现有度量所不具备的直觉需求。最后，我们提出了SliceNDice算法，该算法能够有效地提取高度可疑的实体组，并在生产中展示了其实用性，在Snapchat的大型广告商生态系统中具有强大的检测性能和发现(89%的精度和大量的真实欺诈环的发现)，显着优于基线(在模拟设置中超过97%的精度/召回率)和线性可扩展性。

{"title":"SliceNDice: Mining Suspicious Multi-Attribute Entity Groups with Multi-View Graphs","authors":"H. Nilforoshan, Neil Shah","doi":"10.1109/dsaa.2019.00050","DOIUrl":"https://doi.org/10.1109/dsaa.2019.00050","url":null,"abstract":"Given the reach of web platforms, bad actors have considerable incentives to manipulate and defraud users at the expense of platform integrity. This has spurred research in numerous suspicious behavior detection tasks, including detection of sybil accounts, false information, and payment scams/fraud. In this paper, we draw the insight that many such initiatives can be tackled in a common framework by posing a detection task which seeks to find groups of entities which share too many properties with one another across multiple attributes (sybil accounts created at the same time and location, propaganda spreaders broadcasting articles with the same rhetoric and with similar reshares, etc.) Our work makes four core contributions: Firstly, we posit a novel formulation of this task as a multi-view graph mining problem, in which distinct views reflect distinct attribute similarities across entities, and contextual similarity and attribute importance are respected. Secondly, we propose a novel suspiciousness metric for scoring entity groups given the abnormality of their synchronicity across multiple views, which obeys intuitive desiderata that existing metrics do not. Finally, we propose the SliceNDice algorithm which enables efficient extraction of highly suspicious entity groups, and demonstrate its practicality in production, in terms of strong detection performance and discoveries on Snapchat's large advertiser ecosystem (89% precision and numerous discoveries of real fraud rings), marked outperformance of baselines (over 97% precision/recall in simulated settings) and linear scalability.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126284338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Maximum Relevance and Minimum Redundancy Feature Selection Methods for a Marketing Machine Learning Platform 营销机器学习平台的最大相关和最小冗余特征选择方法

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2019-08-15 DOI: 10.1109/dsaa.2019.00059

Zhenyu Zhao, Radhika Anand, Mallory Wang

In machine learning applications for online product offerings and marketing strategies, there are often hundreds or thousands of features available to build such models. Feature selection is one essential method in such applications for multiple objectives: improving the prediction accuracy by eliminating irrelevant features, accelerating the model training and prediction speed, reducing the monitoring and maintenance workload for feature data pipeline, and providing better model interpretation and diagnosis capability. However, selecting an optimal feature subset from a large feature space is considered as an NP-complete problem. The mRMR (Minimum Redundancy and Maximum Relevance) feature selection framework solves this problem by selecting the relevant features while controlling for the redundancy within the selected features. This paper describes the approach to extend, evaluate, and implement the mRMR feature selection methods for classification problem in a marketing machine learning platform at Uber that automates creation and deployment of targeting and personalization models at scale. This study first extends the existing mRMR methods by introducing a non-linear feature redundancy measure and a model-based feature relevance measure. Then an extensive empirical evaluation is performed for eight different feature selection methods, using one synthetic dataset and three real-world marketing datasets at Uber to cover different use cases. Based on the empirical results, the selected mRMR method is implemented in production for the marketing machine learning platform. A description of the production implementation is provided and an online experiment deployed through the platform is discussed.

在用于在线产品提供和营销策略的机器学习应用程序中，通常有数百或数千个可用于构建此类模型的功能。特征选择是此类应用中必不可少的一种方法，它具有多个目标:通过消除不相关特征来提高预测精度，加快模型训练和预测速度，减少特征数据管道的监控和维护工作量，提供更好的模型解释和诊断能力。然而，从一个大的特征空间中选择一个最优的特征子集被认为是一个np完全问题。mRMR(最小冗余和最大相关性)特征选择框架通过选择相关特征同时控制所选特征中的冗余来解决这一问题。本文描述了在Uber的营销机器学习平台上扩展、评估和实现mRMR特征选择方法来解决分类问题的方法，该平台可以自动创建和部署大规模的目标和个性化模型。本研究首先通过引入非线性特征冗余度量和基于模型的特征相关度量来扩展现有的mRMR方法。然后对八种不同的特征选择方法进行了广泛的实证评估，使用一个合成数据集和三个Uber的真实营销数据集来涵盖不同的用例。在实证结果的基础上，将所选择的mRMR方法应用于营销机器学习平台的生产中。给出了生产实现的描述，并讨论了通过该平台部署的在线实验。

{"title":"Maximum Relevance and Minimum Redundancy Feature Selection Methods for a Marketing Machine Learning Platform","authors":"Zhenyu Zhao, Radhika Anand, Mallory Wang","doi":"10.1109/dsaa.2019.00059","DOIUrl":"https://doi.org/10.1109/dsaa.2019.00059","url":null,"abstract":"In machine learning applications for online product offerings and marketing strategies, there are often hundreds or thousands of features available to build such models. Feature selection is one essential method in such applications for multiple objectives: improving the prediction accuracy by eliminating irrelevant features, accelerating the model training and prediction speed, reducing the monitoring and maintenance workload for feature data pipeline, and providing better model interpretation and diagnosis capability. However, selecting an optimal feature subset from a large feature space is considered as an NP-complete problem. The mRMR (Minimum Redundancy and Maximum Relevance) feature selection framework solves this problem by selecting the relevant features while controlling for the redundancy within the selected features. This paper describes the approach to extend, evaluate, and implement the mRMR feature selection methods for classification problem in a marketing machine learning platform at Uber that automates creation and deployment of targeting and personalization models at scale. This study first extends the existing mRMR methods by introducing a non-linear feature redundancy measure and a model-based feature relevance measure. Then an extensive empirical evaluation is performed for eight different feature selection methods, using one synthetic dataset and three real-world marketing datasets at Uber to cover different use cases. Based on the empirical results, the selected mRMR method is implemented in production for the marketing machine learning platform. A description of the production implementation is provided and an online experiment deployed through the platform is discussed.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115357387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 89

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀