Detecting Sensitive Content in Spoken Language

Rahul Tripathi, Balaji Dhamodharaswamy, S. Jagannathan, Abhishek Nandi
{"title":"Detecting Sensitive Content in Spoken Language","authors":"Rahul Tripathi, Balaji Dhamodharaswamy, S. Jagannathan, Abhishek Nandi","doi":"10.1109/DSAA.2019.00052","DOIUrl":null,"url":null,"abstract":"Spoken language can include sensitive topics including profanity, insults, political and offensive speech. In order to engage in contextually appropriate conversations, it is essential for voice services such as Alexa, Google Assistant, Siri, etc. to detect sensitive topics in the conversations and react appropriately. A simple approach to detect sensitive topics is to use regular expression or keyword based rules. However, keyword based rules have several drawbacks: (1) coverage (recall) depends on the exhaustiveness of the keywords, and (2) rules do not scale and generalize well even for minor variations of the keywords. Machine learning (ML) approaches provide the potential benefit of generalization, but require large volumes of training data, which is difficult to obtain for sparse data problems. This paper describes: (1) a ML based solution that uses training data (2.1M dataset), obtained from synthetic generation and semi-supervised learning techniques, to detect sensitive content in spoken language; and (2) the results of evaluating its performance on several million test instances of live utterances. The results show that our ML models have very high precision (>>90%). Moreover, in spite of relying on synthetic training data, the ML models are able to generalize beyond the training data to identify significantly higher amounts (~2x for Logistic Regression, and ~4x-6x for a Neural Network models such as Bi-LSTM and CNN) of the test stream as sensitive in comparison to a baseline approach using the training data (~ 1 Million examples) as rules. We are able to train our models with very few manual annotations. The percentage share of sensitive examples in our training dataset from synthetic generation using templates and manual annotations are 98.04% and 1.96%, respectively. The percentage share of non-sensitive examples in our training dataset from synthetic generation using templates, automated labeling via semi-supervised techniques, and manual annotations are 15.35%, 83.75%, and 0.90%, respectively. The neural network models (Bi-LSTM and CNN) also use lower memory footprint (22.5% lower than baseline and 80% lower than Logistic Regression) while giving improved accuracy.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSAA.2019.00052","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

Spoken language can include sensitive topics including profanity, insults, political and offensive speech. In order to engage in contextually appropriate conversations, it is essential for voice services such as Alexa, Google Assistant, Siri, etc. to detect sensitive topics in the conversations and react appropriately. A simple approach to detect sensitive topics is to use regular expression or keyword based rules. However, keyword based rules have several drawbacks: (1) coverage (recall) depends on the exhaustiveness of the keywords, and (2) rules do not scale and generalize well even for minor variations of the keywords. Machine learning (ML) approaches provide the potential benefit of generalization, but require large volumes of training data, which is difficult to obtain for sparse data problems. This paper describes: (1) a ML based solution that uses training data (2.1M dataset), obtained from synthetic generation and semi-supervised learning techniques, to detect sensitive content in spoken language; and (2) the results of evaluating its performance on several million test instances of live utterances. The results show that our ML models have very high precision (>>90%). Moreover, in spite of relying on synthetic training data, the ML models are able to generalize beyond the training data to identify significantly higher amounts (~2x for Logistic Regression, and ~4x-6x for a Neural Network models such as Bi-LSTM and CNN) of the test stream as sensitive in comparison to a baseline approach using the training data (~ 1 Million examples) as rules. We are able to train our models with very few manual annotations. The percentage share of sensitive examples in our training dataset from synthetic generation using templates and manual annotations are 98.04% and 1.96%, respectively. The percentage share of non-sensitive examples in our training dataset from synthetic generation using templates, automated labeling via semi-supervised techniques, and manual annotations are 15.35%, 83.75%, and 0.90%, respectively. The neural network models (Bi-LSTM and CNN) also use lower memory footprint (22.5% lower than baseline and 80% lower than Logistic Regression) while giving improved accuracy.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
检测口语中的敏感内容
口语可能包含敏感话题,包括亵渎、侮辱、政治和攻击性言论。为了进行上下文适当的对话,语音服务(如Alexa, Google Assistant, Siri等)必须检测对话中的敏感话题并做出适当的反应。检测敏感主题的一种简单方法是使用正则表达式或基于关键字的规则。然而,基于关键字的规则有几个缺点:(1)覆盖(召回)取决于关键字的穷尽性,(2)规则不能很好地扩展和概括,即使是关键字的微小变化。机器学习(ML)方法提供了泛化的潜在好处,但需要大量的训练数据,这对于稀疏数据问题很难获得。本文描述了:(1)一种基于机器学习的解决方案,该解决方案使用合成生成和半监督学习技术获得的训练数据(2.1M数据集)来检测口语中的敏感内容;(2)在数百万个实时语音测试实例上的性能评估结果。结果表明,我们的机器学习模型具有很高的精度(>>90%)。此外,尽管依赖于合成训练数据,与使用训练数据(~ 100万个示例)作为规则的基线方法相比,机器学习模型能够在训练数据之外进行推广,以识别明显更高的测试流数量(逻辑回归为~2x, Bi-LSTM和CNN等神经网络模型为~4 -6x)。我们可以用很少的手工注释来训练我们的模型。在我们的训练数据集中,使用模板和手动注释合成生成的敏感示例的百分比分别为98.04%和1.96%。在我们的训练数据集中,使用模板合成生成、通过半监督技术自动标记和手动注释的非敏感示例所占的百分比分别为15.35%、83.75%和0.90%。神经网络模型(Bi-LSTM和CNN)也使用更低的内存占用(比基线低22.5%,比Logistic回归低80%),同时提高了准确性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A Rapid Prototyping Approach for High Performance Density-Based Clustering Automating Big Data Analysis Based on Deep Learning Generation by Automatic Service Composition Detecting Sensitive Content in Spoken Language Improving the Personalized Recommendation in the Cold-start Scenarios Colorwall: An Embedded Temporal Display of Bibliographic Data
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1