Rahul Tripathi, Balaji Dhamodharaswamy, S. Jagannathan, Abhishek Nandi
{"title":"检测口语中的敏感内容","authors":"Rahul Tripathi, Balaji Dhamodharaswamy, S. Jagannathan, Abhishek Nandi","doi":"10.1109/DSAA.2019.00052","DOIUrl":null,"url":null,"abstract":"Spoken language can include sensitive topics including profanity, insults, political and offensive speech. In order to engage in contextually appropriate conversations, it is essential for voice services such as Alexa, Google Assistant, Siri, etc. to detect sensitive topics in the conversations and react appropriately. A simple approach to detect sensitive topics is to use regular expression or keyword based rules. However, keyword based rules have several drawbacks: (1) coverage (recall) depends on the exhaustiveness of the keywords, and (2) rules do not scale and generalize well even for minor variations of the keywords. Machine learning (ML) approaches provide the potential benefit of generalization, but require large volumes of training data, which is difficult to obtain for sparse data problems. This paper describes: (1) a ML based solution that uses training data (2.1M dataset), obtained from synthetic generation and semi-supervised learning techniques, to detect sensitive content in spoken language; and (2) the results of evaluating its performance on several million test instances of live utterances. The results show that our ML models have very high precision (>>90%). Moreover, in spite of relying on synthetic training data, the ML models are able to generalize beyond the training data to identify significantly higher amounts (~2x for Logistic Regression, and ~4x-6x for a Neural Network models such as Bi-LSTM and CNN) of the test stream as sensitive in comparison to a baseline approach using the training data (~ 1 Million examples) as rules. We are able to train our models with very few manual annotations. The percentage share of sensitive examples in our training dataset from synthetic generation using templates and manual annotations are 98.04% and 1.96%, respectively. The percentage share of non-sensitive examples in our training dataset from synthetic generation using templates, automated labeling via semi-supervised techniques, and manual annotations are 15.35%, 83.75%, and 0.90%, respectively. The neural network models (Bi-LSTM and CNN) also use lower memory footprint (22.5% lower than baseline and 80% lower than Logistic Regression) while giving improved accuracy.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Detecting Sensitive Content in Spoken Language\",\"authors\":\"Rahul Tripathi, Balaji Dhamodharaswamy, S. Jagannathan, Abhishek Nandi\",\"doi\":\"10.1109/DSAA.2019.00052\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Spoken language can include sensitive topics including profanity, insults, political and offensive speech. In order to engage in contextually appropriate conversations, it is essential for voice services such as Alexa, Google Assistant, Siri, etc. to detect sensitive topics in the conversations and react appropriately. A simple approach to detect sensitive topics is to use regular expression or keyword based rules. However, keyword based rules have several drawbacks: (1) coverage (recall) depends on the exhaustiveness of the keywords, and (2) rules do not scale and generalize well even for minor variations of the keywords. Machine learning (ML) approaches provide the potential benefit of generalization, but require large volumes of training data, which is difficult to obtain for sparse data problems. This paper describes: (1) a ML based solution that uses training data (2.1M dataset), obtained from synthetic generation and semi-supervised learning techniques, to detect sensitive content in spoken language; and (2) the results of evaluating its performance on several million test instances of live utterances. The results show that our ML models have very high precision (>>90%). Moreover, in spite of relying on synthetic training data, the ML models are able to generalize beyond the training data to identify significantly higher amounts (~2x for Logistic Regression, and ~4x-6x for a Neural Network models such as Bi-LSTM and CNN) of the test stream as sensitive in comparison to a baseline approach using the training data (~ 1 Million examples) as rules. We are able to train our models with very few manual annotations. The percentage share of sensitive examples in our training dataset from synthetic generation using templates and manual annotations are 98.04% and 1.96%, respectively. The percentage share of non-sensitive examples in our training dataset from synthetic generation using templates, automated labeling via semi-supervised techniques, and manual annotations are 15.35%, 83.75%, and 0.90%, respectively. The neural network models (Bi-LSTM and CNN) also use lower memory footprint (22.5% lower than baseline and 80% lower than Logistic Regression) while giving improved accuracy.\",\"PeriodicalId\":416037,\"journal\":{\"name\":\"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)\",\"volume\":\"30 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DSAA.2019.00052\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSAA.2019.00052","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
摘要
口语可能包含敏感话题,包括亵渎、侮辱、政治和攻击性言论。为了进行上下文适当的对话,语音服务(如Alexa, Google Assistant, Siri等)必须检测对话中的敏感话题并做出适当的反应。检测敏感主题的一种简单方法是使用正则表达式或基于关键字的规则。然而,基于关键字的规则有几个缺点:(1)覆盖(召回)取决于关键字的穷尽性,(2)规则不能很好地扩展和概括,即使是关键字的微小变化。机器学习(ML)方法提供了泛化的潜在好处,但需要大量的训练数据,这对于稀疏数据问题很难获得。本文描述了:(1)一种基于机器学习的解决方案,该解决方案使用合成生成和半监督学习技术获得的训练数据(2.1M数据集)来检测口语中的敏感内容;(2)在数百万个实时语音测试实例上的性能评估结果。结果表明,我们的机器学习模型具有很高的精度(>>90%)。此外,尽管依赖于合成训练数据,与使用训练数据(~ 100万个示例)作为规则的基线方法相比,机器学习模型能够在训练数据之外进行推广,以识别明显更高的测试流数量(逻辑回归为~2x, Bi-LSTM和CNN等神经网络模型为~4 -6x)。我们可以用很少的手工注释来训练我们的模型。在我们的训练数据集中,使用模板和手动注释合成生成的敏感示例的百分比分别为98.04%和1.96%。在我们的训练数据集中,使用模板合成生成、通过半监督技术自动标记和手动注释的非敏感示例所占的百分比分别为15.35%、83.75%和0.90%。神经网络模型(Bi-LSTM和CNN)也使用更低的内存占用(比基线低22.5%,比Logistic回归低80%),同时提高了准确性。
Spoken language can include sensitive topics including profanity, insults, political and offensive speech. In order to engage in contextually appropriate conversations, it is essential for voice services such as Alexa, Google Assistant, Siri, etc. to detect sensitive topics in the conversations and react appropriately. A simple approach to detect sensitive topics is to use regular expression or keyword based rules. However, keyword based rules have several drawbacks: (1) coverage (recall) depends on the exhaustiveness of the keywords, and (2) rules do not scale and generalize well even for minor variations of the keywords. Machine learning (ML) approaches provide the potential benefit of generalization, but require large volumes of training data, which is difficult to obtain for sparse data problems. This paper describes: (1) a ML based solution that uses training data (2.1M dataset), obtained from synthetic generation and semi-supervised learning techniques, to detect sensitive content in spoken language; and (2) the results of evaluating its performance on several million test instances of live utterances. The results show that our ML models have very high precision (>>90%). Moreover, in spite of relying on synthetic training data, the ML models are able to generalize beyond the training data to identify significantly higher amounts (~2x for Logistic Regression, and ~4x-6x for a Neural Network models such as Bi-LSTM and CNN) of the test stream as sensitive in comparison to a baseline approach using the training data (~ 1 Million examples) as rules. We are able to train our models with very few manual annotations. The percentage share of sensitive examples in our training dataset from synthetic generation using templates and manual annotations are 98.04% and 1.96%, respectively. The percentage share of non-sensitive examples in our training dataset from synthetic generation using templates, automated labeling via semi-supervised techniques, and manual annotations are 15.35%, 83.75%, and 0.90%, respectively. The neural network models (Bi-LSTM and CNN) also use lower memory footprint (22.5% lower than baseline and 80% lower than Logistic Regression) while giving improved accuracy.