从企业数据中大规模抽取实体

Rajeev Gupta, Ranganath Kondapally
{"title":"从企业数据中大规模抽取实体","authors":"Rajeev Gupta, Ranganath Kondapally","doi":"10.1145/3564121.3564818","DOIUrl":null,"url":null,"abstract":"Adoption of cloud computing by enterprises has exploded in the last decade and most of the applications used by enterprise users have moved to the cloud. These applications include collaboration software(e.g., Word, Excel), instant messaging (e.g., Chat), asynchronous communication (e.g., Email), etc. This has resulted in an exponential increase in the volume of data arising from the interactions of the users with the online applications (such as documents edited, people interacted with, meetings attended, etc.). Activities of a user provide strong insights about her such as meetings attended by the user indicate the set of people the user closely works with and documents edited indicate the topics the user works on, etc. Typically, this data is private and confidential for the enterprise, part of the enterprise, or the individual employee. To provide better experience and assist employees in their activities, it is critical to mine certain entities from this data. In this tutorial, we explain various entities which can be extracted from the enterprise data and assist the employees in their productivity. Specifically, we define and extract various enterprise entities such as tasks, commitments, calendar activity, acronyms, topics, definitions, etc. These entities are extracted using different techniques—tasks and commitments are extracted using intent mining techniques (e.g., sentiment extraction), definitions are extracted using sequence mining techniques, calendars are updated using the user’s flight/hotel booking entities, etc. The entity extraction from enterprise data poses interesting and complex challenge from scalable information extraction point of view: building information extraction models where there is little data to learn from due to privacy and access-control constraints but need highly accurate models to run on a large amount of diverse data from whole of the enterprise. Specifically, we need to overcome the following challenges: Privacy: For legal and trust reasons, individual user’s data should be accessible only to the persons who it is intended to. Thus, we can’t directly apply the openly available techniques used to mine these entities which all require labeled data. Efficiency: As enterprises need to process billions of emails, chats, and other documents every day—different for different users—extraction models need to be very efficient. Scalability: There are a large number of variations in the way information is presented in the enterprise documents. For example, a flight itinerary is represented in different ways by different providers. Definition of the same topic can be expressed differently in different documents. We should be able to extract entities irrespective of the way it is presented in the documents. Multi-lingual: Users are located across geographies, and hence, the information extraction needs to be done across multiple languages. To extract these entities, one needs supervised data. How to get labeled data in a privacy preserving manner? How do we build models with the minimum amount of supervised data? We have a large amount of unsupervised data. We present techniques to learn from large, unsupervised data along with small, supervised data. In various techniques user-feedback (e.g., clicks) are used to refine the information extraction models. Feedback is difficult to come by in the enterprise settings. Can we use weak supervision? Can we take an off-the-shelf model (say, for definition classification) and refine it for enterprise settings? We will be covering all these techniques with improved precision and recall in the enterprise settings.","PeriodicalId":166150,"journal":{"name":"Proceedings of the Second International Conference on AI-ML Systems","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Large-Scale Entity Extraction from Enterprise Data\",\"authors\":\"Rajeev Gupta, Ranganath Kondapally\",\"doi\":\"10.1145/3564121.3564818\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Adoption of cloud computing by enterprises has exploded in the last decade and most of the applications used by enterprise users have moved to the cloud. These applications include collaboration software(e.g., Word, Excel), instant messaging (e.g., Chat), asynchronous communication (e.g., Email), etc. This has resulted in an exponential increase in the volume of data arising from the interactions of the users with the online applications (such as documents edited, people interacted with, meetings attended, etc.). Activities of a user provide strong insights about her such as meetings attended by the user indicate the set of people the user closely works with and documents edited indicate the topics the user works on, etc. Typically, this data is private and confidential for the enterprise, part of the enterprise, or the individual employee. To provide better experience and assist employees in their activities, it is critical to mine certain entities from this data. In this tutorial, we explain various entities which can be extracted from the enterprise data and assist the employees in their productivity. Specifically, we define and extract various enterprise entities such as tasks, commitments, calendar activity, acronyms, topics, definitions, etc. These entities are extracted using different techniques—tasks and commitments are extracted using intent mining techniques (e.g., sentiment extraction), definitions are extracted using sequence mining techniques, calendars are updated using the user’s flight/hotel booking entities, etc. The entity extraction from enterprise data poses interesting and complex challenge from scalable information extraction point of view: building information extraction models where there is little data to learn from due to privacy and access-control constraints but need highly accurate models to run on a large amount of diverse data from whole of the enterprise. Specifically, we need to overcome the following challenges: Privacy: For legal and trust reasons, individual user’s data should be accessible only to the persons who it is intended to. Thus, we can’t directly apply the openly available techniques used to mine these entities which all require labeled data. Efficiency: As enterprises need to process billions of emails, chats, and other documents every day—different for different users—extraction models need to be very efficient. Scalability: There are a large number of variations in the way information is presented in the enterprise documents. For example, a flight itinerary is represented in different ways by different providers. Definition of the same topic can be expressed differently in different documents. We should be able to extract entities irrespective of the way it is presented in the documents. Multi-lingual: Users are located across geographies, and hence, the information extraction needs to be done across multiple languages. To extract these entities, one needs supervised data. How to get labeled data in a privacy preserving manner? How do we build models with the minimum amount of supervised data? We have a large amount of unsupervised data. We present techniques to learn from large, unsupervised data along with small, supervised data. In various techniques user-feedback (e.g., clicks) are used to refine the information extraction models. Feedback is difficult to come by in the enterprise settings. Can we use weak supervision? Can we take an off-the-shelf model (say, for definition classification) and refine it for enterprise settings? We will be covering all these techniques with improved precision and recall in the enterprise settings.\",\"PeriodicalId\":166150,\"journal\":{\"name\":\"Proceedings of the Second International Conference on AI-ML Systems\",\"volume\":\"4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Second International Conference on AI-ML Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3564121.3564818\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Second International Conference on AI-ML Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3564121.3564818","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

在过去十年中,企业对云计算的采用呈爆炸式增长,企业用户使用的大多数应用程序都转移到了云上。这些应用程序包括协作软件(例如。(如Word, Excel),即时通讯(如聊天),异步通信(如电子邮件)等。这导致用户与在线应用程序交互产生的数据量呈指数级增长(例如编辑的文档、与之交互的人员、参加的会议等)。用户的活动提供了关于用户的深刻见解,例如用户参加的会议表明与用户密切合作的一组人,编辑的文档表明用户从事的主题,等等。通常,这些数据对于企业、企业的一部分或单个员工来说是私有和机密的。为了提供更好的体验并帮助员工开展活动,从这些数据中挖掘某些实体至关重要。在本教程中,我们将解释可以从企业数据中提取的各种实体,并帮助员工提高生产力。具体来说,我们定义和提取各种企业实体,如任务、承诺、日历活动、首字母缩略词、主题、定义等。这些实体是使用不同的技术提取的——任务和承诺是使用意图挖掘技术提取的(例如,情感提取),定义是使用序列挖掘技术提取的,日历是使用用户的航班/酒店预订实体更新的等等。从可扩展信息提取的角度来看,从企业数据中提取实体提出了有趣而复杂的挑战:由于隐私和访问控制约束,构建信息提取模型时几乎没有数据可以学习,但需要高度精确的模型来运行来自整个企业的大量不同数据。具体来说,我们需要克服以下挑战:隐私:出于法律和信任的原因,个人用户的数据应该只能被预期的人访问。因此,我们不能直接应用公开可用的技术来挖掘这些都需要标记数据的实体。效率:由于企业每天需要处理数十亿封电子邮件、聊天记录和其他文档,因此提取模型需要非常高效。可伸缩性:在企业文档中显示信息的方式有很多变化。例如,航班行程由不同的提供者以不同的方式表示。同一主题的定义在不同的文档中可以有不同的表达方式。我们应该能够提取实体,而不管它在文档中的呈现方式如何。多语言:用户位于不同的地理位置,因此,信息提取需要跨多种语言完成。要提取这些实体,需要有监督的数据。如何以保护隐私的方式获得标记数据?我们如何用最少的监督数据构建模型?我们有大量的无监督数据。我们提出了从大型无监督数据和小型有监督数据中学习的技术。在各种技术中,用户反馈(例如,点击)被用来改进信息提取模型。在企业环境中很难获得反馈。我们可以使用弱监管吗?我们是否可以采用现成的模型(例如,用于定义分类)并将其细化为企业设置?我们将在企业设置中以更高的精度和召回率介绍所有这些技术。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Large-Scale Entity Extraction from Enterprise Data
Adoption of cloud computing by enterprises has exploded in the last decade and most of the applications used by enterprise users have moved to the cloud. These applications include collaboration software(e.g., Word, Excel), instant messaging (e.g., Chat), asynchronous communication (e.g., Email), etc. This has resulted in an exponential increase in the volume of data arising from the interactions of the users with the online applications (such as documents edited, people interacted with, meetings attended, etc.). Activities of a user provide strong insights about her such as meetings attended by the user indicate the set of people the user closely works with and documents edited indicate the topics the user works on, etc. Typically, this data is private and confidential for the enterprise, part of the enterprise, or the individual employee. To provide better experience and assist employees in their activities, it is critical to mine certain entities from this data. In this tutorial, we explain various entities which can be extracted from the enterprise data and assist the employees in their productivity. Specifically, we define and extract various enterprise entities such as tasks, commitments, calendar activity, acronyms, topics, definitions, etc. These entities are extracted using different techniques—tasks and commitments are extracted using intent mining techniques (e.g., sentiment extraction), definitions are extracted using sequence mining techniques, calendars are updated using the user’s flight/hotel booking entities, etc. The entity extraction from enterprise data poses interesting and complex challenge from scalable information extraction point of view: building information extraction models where there is little data to learn from due to privacy and access-control constraints but need highly accurate models to run on a large amount of diverse data from whole of the enterprise. Specifically, we need to overcome the following challenges: Privacy: For legal and trust reasons, individual user’s data should be accessible only to the persons who it is intended to. Thus, we can’t directly apply the openly available techniques used to mine these entities which all require labeled data. Efficiency: As enterprises need to process billions of emails, chats, and other documents every day—different for different users—extraction models need to be very efficient. Scalability: There are a large number of variations in the way information is presented in the enterprise documents. For example, a flight itinerary is represented in different ways by different providers. Definition of the same topic can be expressed differently in different documents. We should be able to extract entities irrespective of the way it is presented in the documents. Multi-lingual: Users are located across geographies, and hence, the information extraction needs to be done across multiple languages. To extract these entities, one needs supervised data. How to get labeled data in a privacy preserving manner? How do we build models with the minimum amount of supervised data? We have a large amount of unsupervised data. We present techniques to learn from large, unsupervised data along with small, supervised data. In various techniques user-feedback (e.g., clicks) are used to refine the information extraction models. Feedback is difficult to come by in the enterprise settings. Can we use weak supervision? Can we take an off-the-shelf model (say, for definition classification) and refine it for enterprise settings? We will be covering all these techniques with improved precision and recall in the enterprise settings.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A Hybrid Planning System for Smart Charging of Electric Fleets CluSpa: Computation Reduction in CNN Inference by exploiting Clustering and Sparsity Acceleration-aware, Retraining-free Evolutionary Pruning for Automated Fitment of Deep Learning Models on Edge Devices Patch-wise Features for Blur Image Classification Identification of Causal Dependencies in Multivariate Time Series
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1