2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)最新文献

FARE: Schema-Agnostic Anomaly Detection in Social Event Logs 社会事件日志中模式不可知的异常检测

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2019-10-01 DOI: 10.1109/DSAA.2019.00049

Neil Shah

Online social platforms are constantly under attack by bad actors. These bad actors often leverage resources (e.g. IPs, devices) under their control to attack the platform by targeting various, vulnerable endpoints (e.g. account authentication, sybil account creation, friending) which may process millions to billions of events every day. As the scale and multifacetedness of malicious behaviors grows, and new endpoints and corresponding events are utilized and processed every day, the development of fast, extensible and schema-agnostic anomaly detection approaches to enable standardized protocols for different classes of events is critical. This is a notable challenge given that practitioners often have neither time nor means to custom-build anomaly detection services for each new event class and type. Moreover, labeled data is rarely available in such diverse settings, making unsupervised methods appealing. In this work, we study unsupervised, schema-agnostic characterization and detection of resource usage anomalies in social event logs. We propose an efficient algorithmic approach to this end, and evaluate it with promising results on several log datasets of different event classes. Specifically, our contributions include a) formulation: a novel articulation of the schema-agnostic anomaly detection problem for event logs, b) approach: we propose FARE (Finding Anomalous Resources and Events), which integrates online resource anomaly detection and offline event culpability identification components, and c) efficacy: demonstrated accuracy (100% precision@250 on two industrial datasets from the Snapchat platform, with ~50% anomalies previously uncaught by state-of-the-art production defenses), robustness (high precision/recall over suitable synthetic attacks and parameter choices) and scalability (near-linear in the number of events).

在线社交平台不断受到不良行为者的攻击。这些不良行为者经常利用他们控制下的资源(例如ip，设备)，通过瞄准各种易受攻击的端点(例如帐户身份验证，sybil帐户创建，好友)来攻击平台，这些端点每天可能处理数百万到数十亿个事件。随着恶意行为规模和多面性的增长，以及每天都有新的端点和相应的事件被利用和处理，开发快速、可扩展和模式无关的异常检测方法以实现针对不同类型事件的标准化协议至关重要。这是一个值得注意的挑战，因为从业者通常既没有时间也没有办法为每个新的事件类和类型定制构建异常检测服务。此外，在如此多样化的环境中，标记数据很少可用，这使得无监督方法具有吸引力。在这项工作中，我们研究了社会事件日志中资源使用异常的无监督、模式无关的表征和检测。为此，我们提出了一种有效的算法方法，并在不同事件类别的多个日志数据集上对其进行了评估，取得了令人满意的结果。具体来说，我们的贡献包括a)公式:事件日志中模式无关的异常检测问题的新表述;b)方法:我们提出了FARE(查找异常资源和事件)，它集成了在线资源异常检测和离线事件罪责识别组件;以及c)有效性:展示了准确性(100% precision@250来自Snapchat平台的两个工业数据集，其中约50%的异常以前未被最先进的生产防御捕获)，鲁棒性(在合适的合成攻击和参数选择上具有高精度/召回率)和可扩展性(事件数量接近线性)。

{"title":"FARE: Schema-Agnostic Anomaly Detection in Social Event Logs","authors":"Neil Shah","doi":"10.1109/DSAA.2019.00049","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00049","url":null,"abstract":"Online social platforms are constantly under attack by bad actors. These bad actors often leverage resources (e.g. IPs, devices) under their control to attack the platform by targeting various, vulnerable endpoints (e.g. account authentication, sybil account creation, friending) which may process millions to billions of events every day. As the scale and multifacetedness of malicious behaviors grows, and new endpoints and corresponding events are utilized and processed every day, the development of fast, extensible and schema-agnostic anomaly detection approaches to enable standardized protocols for different classes of events is critical. This is a notable challenge given that practitioners often have neither time nor means to custom-build anomaly detection services for each new event class and type. Moreover, labeled data is rarely available in such diverse settings, making unsupervised methods appealing. In this work, we study unsupervised, schema-agnostic characterization and detection of resource usage anomalies in social event logs. We propose an efficient algorithmic approach to this end, and evaluate it with promising results on several log datasets of different event classes. Specifically, our contributions include a) formulation: a novel articulation of the schema-agnostic anomaly detection problem for event logs, b) approach: we propose FARE (Finding Anomalous Resources and Events), which integrates online resource anomaly detection and offline event culpability identification components, and c) efficacy: demonstrated accuracy (100% precision@250 on two industrial datasets from the Snapchat platform, with ~50% anomalies previously uncaught by state-of-the-art production defenses), robustness (high precision/recall over suitable synthetic attacks and parameter choices) and scalability (near-linear in the number of events).","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121790725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Unsupervised Drift Detector Ensembles for Data Stream Mining 用于数据流挖掘的无监督漂移检测器集成

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2019-10-01 DOI: 10.1109/DSAA.2019.00047

Lukasz Korycki, B. Krawczyk

Data stream mining is among the most contemporary branches of machine learning. The potentially infinite sources give us many opportunities and at the same time pose new challenges. To properly handle streaming data we need to improve our well-established methods, so they can work with dynamic data and under strict constraints. Supervised streaming machine learning algorithms require a certain number of labeled instances in order to stay up-to-date. Since high budgets dedicated for this purpose are usually infeasible, we have to limit the supervision as much as we can. One possible approach is to trigger labeling, only if a change is explicitly indicated by a detector. While there are several supervised algorithms dedicated for this purpose, the more practical unsupervised ones are still lacking a proper attention. In this paper, we propose a novel unsupervised ensemble drift detector that recognizes local changes in feature subspaces (EDFS) without additional supervision, using specialized committees of incremental Kolmogorov-Smirnov tests. We combine it with an adaptive classifier and update it, only if the drift detector signalizes a change. Conducted experiments show that our framework is able to efficiently adapt to various concept drifts and outperform other unsupervised algorithms.

数据流挖掘是机器学习最现代的分支之一。潜在的无限资源给我们带来了许多机会，同时也带来了新的挑战。为了正确处理流数据，我们需要改进现有的方法，使它们能够在严格的约束下处理动态数据。监督流机器学习算法需要一定数量的标记实例才能保持最新状态。由于专门用于此目的的高预算通常是不可行的，因此我们必须尽可能地限制监督。一种可能的方法是触发标记，只有当一个变化是由检测器明确指出。虽然有几种专门用于此目的的监督算法，但更实用的无监督算法仍然缺乏适当的关注。在本文中，我们提出了一种新的无监督集成漂移检测器，它可以在没有额外监督的情况下识别特征子空间(EDFS)的局部变化，使用增量Kolmogorov-Smirnov测试的专门委员会。我们将其与自适应分类器结合并更新它，只有当漂移检测器发出变化信号时。实验表明，我们的框架能够有效地适应各种概念漂移，并且优于其他无监督算法。

{"title":"Unsupervised Drift Detector Ensembles for Data Stream Mining","authors":"Lukasz Korycki, B. Krawczyk","doi":"10.1109/DSAA.2019.00047","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00047","url":null,"abstract":"Data stream mining is among the most contemporary branches of machine learning. The potentially infinite sources give us many opportunities and at the same time pose new challenges. To properly handle streaming data we need to improve our well-established methods, so they can work with dynamic data and under strict constraints. Supervised streaming machine learning algorithms require a certain number of labeled instances in order to stay up-to-date. Since high budgets dedicated for this purpose are usually infeasible, we have to limit the supervision as much as we can. One possible approach is to trigger labeling, only if a change is explicitly indicated by a detector. While there are several supervised algorithms dedicated for this purpose, the more practical unsupervised ones are still lacking a proper attention. In this paper, we propose a novel unsupervised ensemble drift detector that recognizes local changes in feature subspaces (EDFS) without additional supervision, using specialized committees of incremental Kolmogorov-Smirnov tests. We combine it with an adaptive classifier and update it, only if the drift detector signalizes a change. Conducted experiments show that our framework is able to efficiently adapt to various concept drifts and outperform other unsupervised algorithms.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"2018 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121536689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Matrix Profile XVI: Efficient and Effective Labeling of Massive Time Series Archives 矩阵剖面十六:海量时间序列档案的高效标注

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2019-10-01 DOI: 10.1109/DSAA.2019.00061

Frank Madrid, Shailendra Singh, Q. Chesnais, K. Mauck, Eamonn J. Keogh

In domains as diverse as entomology and sports medicine, analysts are routinely required to label large amounts of time series data. In a few rare cases, this can be done automatically with a classification algorithm. In many domains however, complex, noisy, and polymorphic data can defeat state-of-the-art classifiers, yet easily yield to human inspection and annotation. This is especially true if the human can access auxiliary information and previous annotations. This labeling task can be a significant bottleneck in scientific progress. For example, an entomology or sports physiology lab may produce several days worth of time series each day. In this work, we introduce an algorithm that greatly reduces the human effort required. Our interactive algorithm groups subsequences and invites the user to label a group's prototype, brushing the label to all members of the group. Thus, our task reduces to optimizing the grouping(s), to allow our system to ask the fewest questions of the user. As we shall show, on diverse domains, we can reduce the human effort by at least an order of magnitude, with no decrease in accuracy.

在昆虫学和运动医学等不同的领域，分析师通常需要标记大量的时间序列数据。在极少数情况下，这可以通过分类算法自动完成。然而，在许多领域，复杂、嘈杂和多态的数据可以击败最先进的分类器，但很容易屈服于人类的检查和注释。如果人类可以访问辅助信息和以前的注释，这一点尤其正确。这个标记任务可能是科学进步的一个重要瓶颈。例如，昆虫学或运动生理学实验室可能每天产生几天的时间序列。在这项工作中，我们引入了一种算法，大大减少了所需的人力。我们的交互式算法对子序列进行分组，并邀请用户给组的原型贴上标签，将标签刷给组的所有成员。因此，我们的任务简化为优化分组，以允许我们的系统向用户询问最少的问题。正如我们将展示的那样，在不同的领域，我们可以减少至少一个数量级的人力，而不会降低准确性。

{"title":"Matrix Profile XVI: Efficient and Effective Labeling of Massive Time Series Archives","authors":"Frank Madrid, Shailendra Singh, Q. Chesnais, K. Mauck, Eamonn J. Keogh","doi":"10.1109/DSAA.2019.00061","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00061","url":null,"abstract":"In domains as diverse as entomology and sports medicine, analysts are routinely required to label large amounts of time series data. In a few rare cases, this can be done automatically with a classification algorithm. In many domains however, complex, noisy, and polymorphic data can defeat state-of-the-art classifiers, yet easily yield to human inspection and annotation. This is especially true if the human can access auxiliary information and previous annotations. This labeling task can be a significant bottleneck in scientific progress. For example, an entomology or sports physiology lab may produce several days worth of time series each day. In this work, we introduce an algorithm that greatly reduces the human effort required. Our interactive algorithm groups subsequences and invites the user to label a group's prototype, brushing the label to all members of the group. Thus, our task reduces to optimizing the grouping(s), to allow our system to ask the fewest questions of the user. As we shall show, on diverse domains, we can reduce the human effort by at least an order of magnitude, with no decrease in accuracy.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130039738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Higher Order Mining for Monitoring District Heating Substations 区域供热变电站监测的高阶挖掘

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2019-10-01 DOI: 10.1109/DSAA.2019.00053

Shahrooz Abghari, V. Boeva, Jens P. Brage, C. Johansson, Håkan Grahn, Niklas Lavesson

We propose a higher order mining (HOM) approach for modelling, monitoring and analyzing district heating (DH) substations' operational behaviour and performance. HOM is concerned with mining over patterns rather than primary or raw data. The proposed approach uses a combination of different data analysis techniques such as sequential pattern mining, clustering analysis, consensus clustering and minimum spanning tree (MST). Initially, a substation's operational behaviour is modeled by extracting weekly patterns and performing clustering analysis. The substation's performance is monitored by assessing its modeled behaviour for every two consecutive weeks. In case some significant difference is observed, further analysis is performed by integrating the built models into a consensus clustering and applying an MST for identifying deviating behaviours. The results of the study show that our method is robust for detecting deviating and sub-optimal behaviours of DH substations. In addition, the proposed method can facilitate domain experts in the interpretation and understanding of the substations' behaviour and performance by providing different data analysis and visualization techniques.

我们提出了一种高阶挖掘(HOM)方法来建模、监测和分析区域供热(DH)变电站的运行行为和性能。HOM关注的是模式挖掘，而不是主要或原始数据。该方法结合了不同的数据分析技术，如顺序模式挖掘、聚类分析、共识聚类和最小生成树。最初，通过提取每周模式并执行聚类分析，对变电站的操作行为进行建模。通过连续两周评估其模拟行为来监测变电站的性能。如果观察到一些显著差异，则通过将构建的模型集成到共识聚类中并应用MST来识别偏离行为来进行进一步分析。研究结果表明，该方法对于检测DH变电站的偏差和次优行为具有较强的鲁棒性。此外，所提出的方法可以通过提供不同的数据分析和可视化技术，方便领域专家对变电站的行为和性能进行解释和理解。

{"title":"Higher Order Mining for Monitoring District Heating Substations","authors":"Shahrooz Abghari, V. Boeva, Jens P. Brage, C. Johansson, Håkan Grahn, Niklas Lavesson","doi":"10.1109/DSAA.2019.00053","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00053","url":null,"abstract":"We propose a higher order mining (HOM) approach for modelling, monitoring and analyzing district heating (DH) substations' operational behaviour and performance. HOM is concerned with mining over patterns rather than primary or raw data. The proposed approach uses a combination of different data analysis techniques such as sequential pattern mining, clustering analysis, consensus clustering and minimum spanning tree (MST). Initially, a substation's operational behaviour is modeled by extracting weekly patterns and performing clustering analysis. The substation's performance is monitored by assessing its modeled behaviour for every two consecutive weeks. In case some significant difference is observed, further analysis is performed by integrating the built models into a consensus clustering and applying an MST for identifying deviating behaviours. The results of the study show that our method is robust for detecting deviating and sub-optimal behaviours of DH substations. In addition, the proposed method can facilitate domain experts in the interpretation and understanding of the substations' behaviour and performance by providing different data analysis and visualization techniques.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129497437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Breast Cancer Classification using Deep Transfer Learning on Structured Healthcare Data 基于结构化医疗数据的深度迁移学习的乳腺癌分类

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2019-10-01 DOI: 10.1109/DSAA.2019.00043

A. Farhadi, David Chen, R. McCoy, Christopher G. Scott, J. Miller, C. Vachon, Che Ngufor

Efforts to improve early identification of aggressive high grade breast cancers, which pose the greatest risk to patient health if not detected early, are hindered by the rarity of these events. To address this problem, we proposed an accurate and efficient deep transfer learning method to handle the imbalanced data problem that is prominent in breast cancer data. In contrast to existing approaches based primarily on large image databases, we focused on structured data, which has not been commonly used for deep transfer learning. We used a number of publicly available breast cancer data sets to generate a "pre-trained" model and transfer learned concepts to predict high grade malignant tumors in patients diagnosed with breast cancer at Mayo Clinic. We compared our results with state-of-the-art techniques for addressing the problem of imbalanced learning and confirmed the superiority of the proposed method. To further demonstrate the ability of the proposed method to handle different degrees of class imbalance, a series of experiments were performed on publicly available breast cancer data under simulated class imbalanced settings. Based on the experimental results, we concluded that the proposed deep transfer learning on structured data can be used as an efficient method to handle imbalanced class problems in clinical research.

侵袭性高级别乳腺癌如果不及早发现，将对患者的健康构成最大的风险，提高早期识别的努力受到罕见事件的阻碍。针对这一问题，我们提出了一种准确高效的深度迁移学习方法来处理乳腺癌数据中突出的数据不平衡问题。与现有的主要基于大型图像数据库的方法相比，我们关注的是结构化数据，这在深度迁移学习中并不常用。我们使用大量公开可用的乳腺癌数据集来生成一个“预训练”模型，并将学习到的概念转移到梅奥诊所诊断为乳腺癌的患者中，以预测高级别恶性肿瘤。我们将我们的结果与解决不平衡学习问题的最新技术进行了比较，并证实了所提出方法的优越性。为了进一步证明所提出的方法处理不同程度的类失衡的能力，在模拟类失衡设置下对公开可用的乳腺癌数据进行了一系列实验。基于实验结果，我们认为基于结构化数据的深度迁移学习可以作为一种有效的方法来处理临床研究中类别不平衡的问题。

{"title":"Breast Cancer Classification using Deep Transfer Learning on Structured Healthcare Data","authors":"A. Farhadi, David Chen, R. McCoy, Christopher G. Scott, J. Miller, C. Vachon, Che Ngufor","doi":"10.1109/DSAA.2019.00043","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00043","url":null,"abstract":"Efforts to improve early identification of aggressive high grade breast cancers, which pose the greatest risk to patient health if not detected early, are hindered by the rarity of these events. To address this problem, we proposed an accurate and efficient deep transfer learning method to handle the imbalanced data problem that is prominent in breast cancer data. In contrast to existing approaches based primarily on large image databases, we focused on structured data, which has not been commonly used for deep transfer learning. We used a number of publicly available breast cancer data sets to generate a \"pre-trained\" model and transfer learned concepts to predict high grade malignant tumors in patients diagnosed with breast cancer at Mayo Clinic. We compared our results with state-of-the-art techniques for addressing the problem of imbalanced learning and confirmed the superiority of the proposed method. To further demonstrate the ability of the proposed method to handle different degrees of class imbalance, a series of experiments were performed on publicly available breast cancer data under simulated class imbalanced settings. Based on the experimental results, we concluded that the proposed deep transfer learning on structured data can be used as an efficient method to handle imbalanced class problems in clinical research.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128938801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Machine Learning for Efficient Integration of Record Systems for Missing US Service Members 为失踪美国服务人员有效整合记录系统的机器学习

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2019-10-01 DOI: 10.1109/DSAA.2019.00071

Julia D. Warnke-Sommer, Franklin E. Damann

More than 16 million Americans served in World War II. Of these service members, over 400,000 were killed in action during the war. Today, more than 72,000 service members remain unaccounted for from World War II. The United States continues to diligently locate, recover, and identify missing personnel from World War II and other past conflicts to provide the fullest possible accounting. This work importantly provides closure and resolution to numerous US families. To fulfill this mission, massive amounts of information must be integrated from historical records, genealogy records, anthropological data, archeological data, odontology data, and DNA. These disparate data sources are produced and maintained by multiple agencies, with different data governance rules and different internal structuring of service member information. Previously, a manual approach had been undertaken to Extract, Transform, Load (ETL) records from these different data sources, which creates the potential for introduced human error. In addition, a large number of person-hours were required to synthesize this data on a biweekly basis. To address this issue, we implemented (i) a regex decision tree to translate genealogical relationships into DNA type availability and (ii) a machine learning approach for record-linkage between disparate data sources. This application is currently in production and greatly reduces person-hours needed and has a very low error rate for record translation and integration.

超过1600万美国人参加了第二次世界大战。在这些服役人员中，超过40万人在战争期间的行动中丧生。今天，超过72000名二战军人仍然下落不明。美国继续孜孜不倦地寻找、寻回和识别在第二次世界大战和其他过去冲突中失踪的人员，以提供尽可能全面的统计。这项工作为许多美国家庭提供了重要的解决方案。为了完成这一任务，必须整合大量的信息，包括历史记录、家谱记录、人类学数据、考古数据、牙科学数据和DNA。这些不同的数据源由多个机构生成和维护，具有不同的数据治理规则和不同的服务成员信息内部结构。以前，采用了手动方法从这些不同的数据源提取、转换、加载(ETL)记录，这就有可能引入人为错误。此外，每两周综合这些数据需要大量的工时。为了解决这个问题，我们实现了(i)一个正则表达式决策树，将家谱关系转化为DNA类型的可用性;(ii)一种机器学习方法，用于不同数据源之间的记录链接。该应用程序目前正在生产中，大大减少了所需的工时，并且记录翻译和集成的错误率非常低。

{"title":"Machine Learning for Efficient Integration of Record Systems for Missing US Service Members","authors":"Julia D. Warnke-Sommer, Franklin E. Damann","doi":"10.1109/DSAA.2019.00071","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00071","url":null,"abstract":"More than 16 million Americans served in World War II. Of these service members, over 400,000 were killed in action during the war. Today, more than 72,000 service members remain unaccounted for from World War II. The United States continues to diligently locate, recover, and identify missing personnel from World War II and other past conflicts to provide the fullest possible accounting. This work importantly provides closure and resolution to numerous US families. To fulfill this mission, massive amounts of information must be integrated from historical records, genealogy records, anthropological data, archeological data, odontology data, and DNA. These disparate data sources are produced and maintained by multiple agencies, with different data governance rules and different internal structuring of service member information. Previously, a manual approach had been undertaken to Extract, Transform, Load (ETL) records from these different data sources, which creates the potential for introduced human error. In addition, a large number of person-hours were required to synthesize this data on a biweekly basis. To address this issue, we implemented (i) a regex decision tree to translate genealogical relationships into DNA type availability and (ii) a machine learning approach for record-linkage between disparate data sources. This application is currently in production and greatly reduces person-hours needed and has a very low error rate for record translation and integration.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116763360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

An Incremental Technique for Mining Coverage Patterns in Large Databases 大型数据库覆盖模式挖掘的增量技术

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2019-10-01 DOI: 10.1109/DSAA.2019.00036

Akhil Ralla, P. Reddy, Anirban Mondal

Pattern mining is an important task of data mining and involves the extraction of interesting associations from large databases. Typically, pattern mining is carried out from huge databases, which tend to get updated several times. Consequently, as a given database is updated, some of the patterns discovered may become invalid, while some new patterns may emerge. This has motivated significant research efforts in the area of Incremental Mining. The goal of incremental mining is to efficiently and incrementally mine patterns when a database is updated as opposed to mining all of the patterns from scratch from the complete database. Incidentally, research efforts are being made to develop incremental pattern mining algorithms for extracting different kinds of patterns such as frequent patterns, sequential patterns and utility patterns. However, none of the existing works addresses incremental mining in the context of coverage patterns, which has important applications in areas such as banner advertising, search engine advertising and graph mining. In this regard, the main contributions of this work are three-fold. First, we introduce the problem of incremental mining in the context of coverage patterns. Second, we propose the IncCMine algorithm for efficiently extracting the knowledge of coverage patterns when incremental database is added to the existing database. Third, we performed extensive experiments using two real-world click stream datasets and one synthetic dataset. The results of our performance evaluation demonstrate that our proposed IncCMine algorithm indeed improves the performance significantly w.r.t. the existing CMine algorithm.

模式挖掘是数据挖掘的一项重要任务，涉及从大型数据库中提取有趣的关联。通常，模式挖掘是从大型数据库中进行的，这些数据库往往要更新几次。因此，在更新给定的数据库时，发现的一些模式可能会失效，而一些新模式可能会出现。这激发了增量挖掘领域的重大研究工作。增量挖掘的目标是在数据库更新时有效地、增量地挖掘模式，而不是从头开始从整个数据库中挖掘所有模式。顺便提一下，研究人员正在努力开发用于提取不同类型模式(如频繁模式、顺序模式和实用模式)的增量模式挖掘算法。然而，现有的工作都没有涉及覆盖模式背景下的增量挖掘，而覆盖模式在横幅广告、搜索引擎广告和图挖掘等领域有着重要的应用。在这方面，这项工作的主要贡献有三个方面。首先，我们介绍了覆盖模式背景下的增量挖掘问题。其次，我们提出了IncCMine算法，用于在现有数据库中添加增量数据库时有效地提取覆盖模式知识。第三，我们使用两个真实世界的点击流数据集和一个合成数据集进行了广泛的实验。我们的性能评估结果表明，我们提出的IncCMine算法确实比现有的CMine算法显著提高了性能。

{"title":"An Incremental Technique for Mining Coverage Patterns in Large Databases","authors":"Akhil Ralla, P. Reddy, Anirban Mondal","doi":"10.1109/DSAA.2019.00036","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00036","url":null,"abstract":"Pattern mining is an important task of data mining and involves the extraction of interesting associations from large databases. Typically, pattern mining is carried out from huge databases, which tend to get updated several times. Consequently, as a given database is updated, some of the patterns discovered may become invalid, while some new patterns may emerge. This has motivated significant research efforts in the area of Incremental Mining. The goal of incremental mining is to efficiently and incrementally mine patterns when a database is updated as opposed to mining all of the patterns from scratch from the complete database. Incidentally, research efforts are being made to develop incremental pattern mining algorithms for extracting different kinds of patterns such as frequent patterns, sequential patterns and utility patterns. However, none of the existing works addresses incremental mining in the context of coverage patterns, which has important applications in areas such as banner advertising, search engine advertising and graph mining. In this regard, the main contributions of this work are three-fold. First, we introduce the problem of incremental mining in the context of coverage patterns. Second, we propose the IncCMine algorithm for efficiently extracting the knowledge of coverage patterns when incremental database is added to the existing database. Third, we performed extensive experiments using two real-world click stream datasets and one synthetic dataset. The results of our performance evaluation demonstrate that our proposed IncCMine algorithm indeed improves the performance significantly w.r.t. the existing CMine algorithm.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117317499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Chest Tube Management After Lung Resection Surgery using a Classifier 肺切除术后胸管的分类器管理

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2019-10-01 DOI: 10.1109/DSAA.2019.00058

W. Klement, S. Gilbert, D. Maziak, A. Seely, F. Shamji, S. Sundaresan, P. Villeneuve, N. Japkowicz

After lung surgery, a chest tube and a pump are used to manage air leaks and fluid drainage from the chest. The decision to remove or maintain the chest tube is based on drainage data collected from a digital pump that continuously monitors the patient. We construct a classifier to support this clinical decision-making process by identifying patients who may suffer adverse, extended air leaks early on. Intuitively, this problem can be modelled as a time-series fitted to monitoring data. However, we present a solution using a simple classifier constructed from data collected in a specific time frame (36- 48 hours) after surgery. We hypothesize that after surgery, patients struggle to attain a stable (favourable or adverse) status which prevails after a period of discrepancies and inconsistencies in the data. A solutions, we propose, is to identify this time frame when the majority of patients achieve their states of stability. Advantages of this approach include better classification performance with a lower burden of data collection during patient treatment. The paper presents the chest tube management as a classification task performed in a sliding window over time during patient monitoring. Our results show that reliable predictions can be achieved in the time window we identify, and that our classifier reduces unsafe chest tube removals at the expense of potentially maintaining a few that can be removed, i.e., we ensure that chest tubes that need to be maintained are not removed with potentially maintaining a few unnecessarily.

肺部手术后，胸管和泵用于控制胸腔的空气泄漏和液体排出。移除或维持胸管的决定是基于从持续监测患者的数字泵收集的引流数据。我们构建了一个分类器来支持这个临床决策过程，通过识别患者谁可能遭受不利的，延长空气泄漏早期。直观地，这个问题可以建模为一个时间序列拟合监测数据。然而，我们提出了一个解决方案，使用一个简单的分类器，该分类器是根据手术后特定时间框架(36- 48小时)收集的数据构建的。我们假设，在手术后，患者在一段时间的数据差异和不一致后，努力达到稳定(有利或不利)的状态。我们提出的解决方案是确定大多数患者达到稳定状态的时间框架。该方法的优点包括更好的分类性能和较低的患者治疗期间的数据收集负担。本文将胸管管理作为一项分类任务，在患者监测期间在滑动窗口中执行。我们的结果表明，在我们确定的时间窗口内可以实现可靠的预测，并且我们的分类器减少了不安全的胸管移除，代价是可能保留一些可以移除的胸管，也就是说，我们确保需要维护的胸管不会被移除，可能会保留一些不必要的胸管。

{"title":"Chest Tube Management After Lung Resection Surgery using a Classifier","authors":"W. Klement, S. Gilbert, D. Maziak, A. Seely, F. Shamji, S. Sundaresan, P. Villeneuve, N. Japkowicz","doi":"10.1109/DSAA.2019.00058","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00058","url":null,"abstract":"After lung surgery, a chest tube and a pump are used to manage air leaks and fluid drainage from the chest. The decision to remove or maintain the chest tube is based on drainage data collected from a digital pump that continuously monitors the patient. We construct a classifier to support this clinical decision-making process by identifying patients who may suffer adverse, extended air leaks early on. Intuitively, this problem can be modelled as a time-series fitted to monitoring data. However, we present a solution using a simple classifier constructed from data collected in a specific time frame (36- 48 hours) after surgery. We hypothesize that after surgery, patients struggle to attain a stable (favourable or adverse) status which prevails after a period of discrepancies and inconsistencies in the data. A solutions, we propose, is to identify this time frame when the majority of patients achieve their states of stability. Advantages of this approach include better classification performance with a lower burden of data collection during patient treatment. The paper presents the chest tube management as a classification task performed in a sliding window over time during patient monitoring. Our results show that reliable predictions can be achieved in the time window we identify, and that our classifier reduces unsafe chest tube removals at the expense of potentially maintaining a few that can be removed, i.e., we ensure that chest tubes that need to be maintained are not removed with potentially maintaining a few unnecessarily.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123026811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Data Reduction for real-time bridge vibration data on Edge 基于Edge的实时桥梁振动数据约简

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2019-10-01 DOI: 10.1109/DSAA.2019.00077

Anthony Chen, Fu-Hsuan Liu, Sheng-De Wang

In the Internet of Things (IoT) era, with the growing number of data sources, we need to face some challenges such as high cost of the cloud storage caused by large amounts of data. To minimize the communication time and enhance the performance, sending the entire large amount of data is not practical. Thus, it is appropriate to make use of edge computing, or data preprocessing on IoT gateways. In this paper, we propose a data reduction algorithm for the gateway of bridge vibration G-sensors. The data reduction algorithm is based on a pattern system, which is comprised of a pattern library and a pattern classifier. The pattern library is generated by using the K-means clustering method. The results show that the proposed approach is effective in data reduction and outlier detection for bridge vibration data collection on the IoT gateway.

在物联网(IoT)时代，随着数据源数量的不断增加，我们需要面对一些挑战，如大量数据带来的云存储成本高。为了减少通信时间和提高性能，发送全部的大量数据是不现实的。因此，在物联网网关上使用边缘计算或数据预处理是合适的。本文提出了一种桥梁振动g传感器网关的数据约简算法。数据约简算法基于模式系统，该系统由模式库和模式分类器组成。使用K-means聚类方法生成模式库。结果表明，该方法在物联网网关的桥梁振动数据采集中具有有效的数据缩减和异常点检测功能。

引用次数: 6

Hierarchical LSTM Framework for Long-Term Sea Surface Temperature Forecasting 长期海面温度预报的分层LSTM框架

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2019-10-01 DOI: 10.1109/DSAA.2019.00018

Xi Liu, T. Wilson, P. Tan, L. Luo

Multi-step prediction of sea surface temperature (SST) is a challenging problem because small errors in its shortrange forecasts can be compounded to create large errors at longer ranges. In this paper, we propose a hierarchical LSTM framework to improve the accuracy for long-term SST prediction. Our framework alleviates the error accumulation problem in multi-step prediction by leveraging outputs from an ensemble of physically-based dynamical models. Unlike previous methods, which simply take a linear combination of the outputs to produce a single deterministic forecast, our framework learns a nonlinear relationship among the ensemble member forecasts. In addition, its multi-level structure is designed to capture the temporal autocorrelation between forecasts generated for the same lead time as well as those generated for different lead times. Experiments performed using SST data from the tropical Pacific ocean region show that the proposed framework outperforms various baseline methods in more than 70% of the grid cells located in the study region.

海表温度的多步预测是一个具有挑战性的问题，因为其短期预测中的小误差可能会在较长范围内叠加而产生大误差。在本文中，我们提出了一个分层LSTM框架来提高长期海温预测的精度。我们的框架通过利用基于物理的动力学模型集合的输出来缓解多步预测中的误差积累问题。与以前的方法不同，以前的方法只是简单地采用输出的线性组合来产生单一的确定性预测，我们的框架学习了集合成员预测之间的非线性关系。此外，它的多层次结构被设计为捕获在相同交货期和不同交货期生成的预测之间的时间自相关性。利用热带太平洋地区海温数据进行的实验表明，在研究区域70%以上的网格单元中，所提出的框架优于各种基线方法。

引用次数: 6