Big Data Mining and Analytics最新文献_第5页

Survey of Distributed Computing Frameworks for Supporting Big Data Analysis 支持大数据分析的分布式计算框架综述

IF 13.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Mining and Analytics

Pub Date : 2023-01-26 DOI: 10.26599/BDMA.2022.9020014

Xudong Sun;Yulin He;Dingming Wu;Joshua Zhexue Huang

Distributed computing frameworks are the fundamental component of distributed computing systems. They provide an essential way to support the efficient processing of big data on clusters or cloud. The size of big data increases at a pace that is faster than the increase in the big data processing capacity of clusters. Thus, distributed computing frameworks based on the MapReduce computing model are not adequate to support big data analysis tasks which often require running complex analytical algorithms on extremely big data sets in terabytes. In performing such tasks, these frameworks face three challenges: computational inefficiency due to high I/O and communication costs, non-scalability to big data due to memory limit, and limited analytical algorithms because many serial algorithms cannot be implemented in the MapReduce programming model. New distributed computing frameworks need to be developed to conquer these challenges. In this paper, we review MapReduce-type distributed computing frameworks that are currently used in handling big data and discuss their problems when conducting big data analysis. In addition, we present a non-MapReduce distributed computing framework that has the potential to overcome big data analysis challenges.

分布式计算框架是分布式计算系统的基本组成部分。它们提供了一种重要的方式来支持在集群或云上高效处理大数据。大数据的规模增长速度快于集群大数据处理能力的增长速度。因此，基于MapReduce计算模型的分布式计算框架不足以支持大数据分析任务，这些任务通常需要在以TB为单位的超大数据集上运行复杂的分析算法。在执行此类任务时，这些框架面临三个挑战：由于I/O和通信成本高，计算效率低下；由于内存限制，无法扩展到大数据；以及由于许多串行算法无法在MapReduce编程模型中实现，分析算法有限。需要开发新的分布式计算框架来克服这些挑战。在本文中，我们回顾了目前用于处理大数据的MapReduce类型的分布式计算框架，并讨论了它们在进行大数据分析时存在的问题。此外，我们提出了一个非MapReduce分布式计算框架，该框架有可能克服大数据分析的挑战。

{"title":"Survey of Distributed Computing Frameworks for Supporting Big Data Analysis","authors":"Xudong Sun;Yulin He;Dingming Wu;Joshua Zhexue Huang","doi":"10.26599/BDMA.2022.9020014","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020014","url":null,"abstract":"Distributed computing frameworks are the fundamental component of distributed computing systems. They provide an essential way to support the efficient processing of big data on clusters or cloud. The size of big data increases at a pace that is faster than the increase in the big data processing capacity of clusters. Thus, distributed computing frameworks based on the MapReduce computing model are not adequate to support big data analysis tasks which often require running complex analytical algorithms on extremely big data sets in terabytes. In performing such tasks, these frameworks face three challenges: computational inefficiency due to high I/O and communication costs, non-scalability to big data due to memory limit, and limited analytical algorithms because many serial algorithms cannot be implemented in the MapReduce programming model. New distributed computing frameworks need to be developed to conquer these challenges. In this paper, we review MapReduce-type distributed computing frameworks that are currently used in handling big data and discuss their problems when conducting big data analysis. In addition, we present a non-MapReduce distributed computing framework that has the potential to overcome big data analysis challenges.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 2","pages":"154-169"},"PeriodicalIF":13.6,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10026288/10026506.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67984954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Cloud-Based Software Development Lifecycle: A Simplified Algorithm for Cloud Service Provider Evaluation with Metric Analysis 基于云的软件开发生命周期：一种基于度量分析的云服务提供商评估简化算法

IF 13.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Mining and Analytics

Pub Date : 2023-01-26 DOI: 10.26599/BDMA.2022.9020016

Santhosh S;Narayana Swamy Ramaiah

At present, hundreds of cloud vendors in the global market provide various services based on a customer's requirements. All cloud vendors are not the same in terms of the number of services, infrastructure availability, security strategies, cost per customer, and reputation in the market. Thus, software developers and organizations face a dilemma when choosing a suitable cloud vendor for their developmental activities. Thus, there is a need to evaluate various cloud service providers (CSPs) and platforms before choosing a suitable vendor. Already existing solutions are either based on simulation tools as per the requirements or evaluated concerning the quality of service attributes. However, they require more time to collect data, simulate and evaluate the vendor. The proposed work compares various CSPs in terms of major metrics, such as establishment, services, infrastructure, tools, pricing models, market share, etc., based on the comparison, parameter ranking, and weightage allocated. Furthermore, the parameters are categorized depending on the priority level. The weighted average is calculated for each CSP, after which the values are sorted in descending order. The experimental results show the unbiased selection of CSPs based on the chosen parameters. The proposed parameter-ranking priority level weightage (PRPLW) algorithm simplifies the selection of the best-suited cloud vendor in accordance with the requirements of software development.

目前，全球市场上有数百家云供应商根据客户的需求提供各种服务。在服务数量、基础设施可用性、安全策略、每位客户的成本和市场声誉方面，并非所有云供应商都是一样的。因此，软件开发人员和组织在为其开发活动选择合适的云供应商时面临两难境地。因此，在选择合适的供应商之前，需要评估各种云服务提供商（CSP）和平台。现有的解决方案要么基于符合要求的模拟工具，要么根据服务质量属性进行评估。然而，他们需要更多的时间来收集数据、模拟和评估供应商。基于比较、参数排名和分配的权重，拟议的工作在主要指标方面对各种CSP进行了比较，如建立、服务、基础设施、工具、定价模型、市场份额等。此外，根据优先级对参数进行分类。计算每个CSP的加权平均值，然后按降序对值进行排序。实验结果表明，基于所选参数对CSP进行了无偏选择。所提出的参数排序优先级加权（PRPLW）算法根据软件开发的要求简化了最适合云供应商的选择。

{"title":"Cloud-Based Software Development Lifecycle: A Simplified Algorithm for Cloud Service Provider Evaluation with Metric Analysis","authors":"Santhosh S;Narayana Swamy Ramaiah","doi":"10.26599/BDMA.2022.9020016","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020016","url":null,"abstract":"At present, hundreds of cloud vendors in the global market provide various services based on a customer's requirements. All cloud vendors are not the same in terms of the number of services, infrastructure availability, security strategies, cost per customer, and reputation in the market. Thus, software developers and organizations face a dilemma when choosing a suitable cloud vendor for their developmental activities. Thus, there is a need to evaluate various cloud service providers (CSPs) and platforms before choosing a suitable vendor. Already existing solutions are either based on simulation tools as per the requirements or evaluated concerning the quality of service attributes. However, they require more time to collect data, simulate and evaluate the vendor. The proposed work compares various CSPs in terms of major metrics, such as establishment, services, infrastructure, tools, pricing models, market share, etc., based on the comparison, parameter ranking, and weightage allocated. Furthermore, the parameters are categorized depending on the priority level. The weighted average is calculated for each CSP, after which the values are sorted in descending order. The experimental results show the unbiased selection of CSPs based on the chosen parameters. The proposed parameter-ranking priority level weightage (PRPLW) algorithm simplifies the selection of the best-suited cloud vendor in accordance with the requirements of software development.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 2","pages":"127-138"},"PeriodicalIF":13.6,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10026288/10026515.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67984952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

EScope: Effective Event Validation for IoT Systems Based on State Correlation EScope:基于状态相关性的物联网系统有效事件验证

IF 13.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Mining and Analytics

Pub Date : 2023-01-26 DOI: 10.26599/BDMA.2022.9020034

Jian Mao;Xiaohe Xu;Qixiao Lin;Liran Ma;Jianwei Liu

Typical Internet of Things (IoT) systems are event-driven platforms, in which smart sensing devices sense or subscribe to events (device state changes), and react according to the preconfigured trigger-action logic, as known as, automation rules. “Events” are essential elements to perform automatic control in an IoT system. However, events are not always trustworthy. Sensing fake event notifications injected by attackers (called event spoofing attack) can trigger sensitive actions through automation rules without involving authorized users. Existing solutions verify events via “event fingerprints” extracted by surrounding sensors. However, if a system has homogeneous sensors that have strong correlations among them, traditional threshold-based methods may cause information redundancy and noise amplification, consequently, decreasing the checking accuracy. Aiming at this, in this paper, we propose “EScope”, an effective event validation approach to check the authenticity of system events based on device state correlation. EScope selects informative and representative sensors using an Neural-Network-based (NN-based) sensor selection component and extracts a verification sensor set for event validation. We evaluate our approach using an existing dataset provided by Peeves. The experiment results demonstrate that EScope achieves an average 67% sensor amount reduction on 22 events compared with the existing work, and increases the event spoofing detection accuracy.

典型的物联网（IoT）系统是事件驱动的平台，其中智能传感设备感测或订阅事件（设备状态变化），并根据预先配置的触发动作逻辑（即自动化规则）做出反应。“事件”是物联网系统中执行自动控制的重要元素。然而，事件并不总是值得信赖的。感知攻击者注入的虚假事件通知（称为事件欺骗攻击）可以在不涉及授权用户的情况下通过自动化规则触发敏感操作。现有的解决方案通过周围传感器提取的“事件指纹”来验证事件。然而，如果一个系统具有同质传感器，这些传感器之间具有很强的相关性，那么传统的基于阈值的方法可能会导致信息冗余和噪声放大，从而降低检查精度。针对这一点，本文提出了一种基于设备状态相关性的有效事件验证方法“EScope”来检查系统事件的真实性。EScope使用基于神经网络（NN）的传感器选择组件选择信息丰富且具有代表性的传感器，并提取用于事件验证的验证传感器集。我们使用Peeves提供的现有数据集来评估我们的方法。实验结果表明，与现有工作相比，EScope在22个事件上平均减少了67%的传感器数量，并提高了事件欺骗检测的准确性。

{"title":"EScope: Effective Event Validation for IoT Systems Based on State Correlation","authors":"Jian Mao;Xiaohe Xu;Qixiao Lin;Liran Ma;Jianwei Liu","doi":"10.26599/BDMA.2022.9020034","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020034","url":null,"abstract":"Typical Internet of Things (IoT) systems are event-driven platforms, in which smart sensing devices sense or subscribe to events (device state changes), and react according to the preconfigured trigger-action logic, as known as, automation rules. “Events” are essential elements to perform automatic control in an IoT system. However, events are not always trustworthy. Sensing fake event notifications injected by attackers (called event spoofing attack) can trigger sensitive actions through automation rules without involving authorized users. Existing solutions verify events via “event fingerprints” extracted by surrounding sensors. However, if a system has homogeneous sensors that have strong correlations among them, traditional threshold-based methods may cause information redundancy and noise amplification, consequently, decreasing the checking accuracy. Aiming at this, in this paper, we propose “EScope”, an effective event validation approach to check the authenticity of system events based on device state correlation. EScope selects informative and representative sensors using an Neural-Network-based (NN-based) sensor selection component and extracts a verification sensor set for event validation. We evaluate our approach using an existing dataset provided by Peeves. The experiment results demonstrate that EScope achieves an average 67% sensor amount reduction on 22 events compared with the existing work, and increases the event spoofing detection accuracy.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 2","pages":"218-233"},"PeriodicalIF":13.6,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10026288/10026512.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67984886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Medical Knowledge Graph: Data Sources, Construction, Reasoning, and Applications 医学知识图谱：数据源、构建、推理与应用

IF 13.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Mining and Analytics

Pub Date : 2023-01-26 DOI: 10.26599/BDMA.2022.9020021

Xuehong Wu;Junwen Duan;Yi Pan;Min Li

Medical knowledge graphs (MKGs) are the basis for intelligent health care, and they have been in use in a variety of intelligent medical applications. Thus, understanding the research and application development of MKGs will be crucial for future relevant research in the biomedical field. To this end, we offer an in-depth review of MKG in this work. Our research begins with the examination of four types of medical information sources, knowledge graph creation methodologies, and six major themes for MKG development. Furthermore, three popular models of reasoning from the viewpoint of knowledge reasoning are discussed. A reasoning implementation path (RIP) is proposed as a means of expressing the reasoning procedures for MKG. In addition, we explore intelligent medical applications based on RIP and MKG and classify them into nine major types. Finally, we summarize the current state of MKG research based on more than 130 publications and future challenges and opportunities.

医学知识图是智能医疗的基础，已在各种智能医疗应用中得到应用。因此，了解MKG的研究和应用发展将对未来生物医学领域的相关研究至关重要。为此，我们在这项工作中对MKG进行了深入的回顾。我们的研究从四种类型的医学信息源、知识图创建方法和MKG开发的六个主要主题开始。此外，从知识推理的角度讨论了三种流行的推理模型。提出了一种推理实现路径（RIP）来表示MKG的推理过程。此外，我们探索了基于RIP和MKG的智能医疗应用，并将其分为九大类型。最后，我们根据130多篇出版物总结了MKG研究的现状以及未来的挑战和机遇。

引用次数: 13

Efficacy of Bluetooth-Based Data Collection for Road Traffic Analysis and Visualization Using Big Data Analytics 基于蓝牙的数据采集在道路交通分析和大数据可视化中的效果

IF 13.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Mining and Analytics

Pub Date : 2023-01-26 DOI: 10.26599/BDMA.2022.9020039

Ashish Rajeshwar Kulkarni;Narendra Kumar;K. Ramachandra Rao

Effective management of daily road traffic is a huge challenge for traffic personnel. Urban traffic management has come a long way from manual control to artificial intelligence techniques. Still real-time adaptive traffic control is an unfulfilled dream due to lack of low cost and easy to install traffic sensor with real-time communication capability. With increasing number of on-board Bluetooth devices in new generation automobiles, these devices can act as sensors to convey the traffic information indirectly. This paper presents the efficacy of road-side Bluetooth scanners for traffic data collection and big-data analytics to process the collected data to extract traffic parameters. Extracted information and analysis are presented through visualizations and tables. All data analytics and visualizations are carried out off-line in R Studio environment. Reliability aspects of the collected and processed data are also investigated. Higher speed of traffic in one direction owing to the geometry of the road is also established through data analysis. Increased penetration of smart phones and fitness bands in day to day use is also established through the device type of the data collected. The results of this work can be used for regular data collection compared to the traditional road surveys carried out annually or bi-annually. It is also found that compared to previous studies published in the literature, the device penetration rate and sample size found in this study are quite high and very encouraging. This is a novel work in literature, which would be quite useful for effective road traffic management in future.

有效管理日常道路交通对交通人员来说是一个巨大的挑战。从人工控制到人工智能技术，城市交通管理已经走过了漫长的道路。由于缺乏低成本和易于安装的具有实时通信能力的交通传感器，实时自适应交通控制仍然是一个未实现的梦想。随着新一代汽车车载蓝牙设备的日益增多，这些设备可以作为传感器间接传递交通信息。本文介绍了路边蓝牙扫描仪用于交通数据收集和大数据分析的功效，以处理收集的数据来提取交通参数。提取的信息和分析通过可视化和表格呈现。所有数据分析和可视化都是在R Studio环境中离线进行的。还调查了收集和处理的数据的可靠性方面。通过数据分析，还确定了由于道路的几何形状而导致的单向交通的更高速度。通过收集的数据的设备类型，智能手机和健身带在日常使用中的渗透率也有所提高。与每年或每两年进行一次的传统道路调查相比，这项工作的结果可用于定期收集数据。还发现，与文献中发表的先前研究相比，本研究中发现的设备渗透率和样本量相当高，非常令人鼓舞。这是一部新颖的文学作品，对未来有效的道路交通管理有很大的借鉴意义。

{"title":"Efficacy of Bluetooth-Based Data Collection for Road Traffic Analysis and Visualization Using Big Data Analytics","authors":"Ashish Rajeshwar Kulkarni;Narendra Kumar;K. Ramachandra Rao","doi":"10.26599/BDMA.2022.9020039","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020039","url":null,"abstract":"Effective management of daily road traffic is a huge challenge for traffic personnel. Urban traffic management has come a long way from manual control to artificial intelligence techniques. Still real-time adaptive traffic control is an unfulfilled dream due to lack of low cost and easy to install traffic sensor with real-time communication capability. With increasing number of on-board Bluetooth devices in new generation automobiles, these devices can act as sensors to convey the traffic information indirectly. This paper presents the efficacy of road-side Bluetooth scanners for traffic data collection and big-data analytics to process the collected data to extract traffic parameters. Extracted information and analysis are presented through visualizations and tables. All data analytics and visualizations are carried out off-line in R Studio environment. Reliability aspects of the collected and processed data are also investigated. Higher speed of traffic in one direction owing to the geometry of the road is also established through data analysis. Increased penetration of smart phones and fitness bands in day to day use is also established through the device type of the data collected. The results of this work can be used for regular data collection compared to the traditional road surveys carried out annually or bi-annually. It is also found that compared to previous studies published in the literature, the device penetration rate and sample size found in this study are quite high and very encouraging. This is a novel work in literature, which would be quite useful for effective road traffic management in future.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 2","pages":"139-153"},"PeriodicalIF":13.6,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10026288/10026507.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67984955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Semi-Supervised Machine Learning for Fault Detection and Diagnosis of a Rooftop Unit 用于屋顶单元故障检测和诊断的半监督机器学习

IF 13.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Mining and Analytics

Pub Date : 2023-01-26 DOI: 10.26599/BDMA.2022.9020015

Mohammed G. Albayati;Jalal Faraj;Amy Thompson;Prathamesh Patil;Ravi Gorthala;Sanguthevar Rajasekaran

Most heating, ventilation, and air-conditioning (HVAC) systems operate with one or more faults that result in increased energy consumption and that could lead to system failure over time. Today, most building owners are performing reactive maintenance only and may be less concerned or less able to assess the health of the system until catastrophic failure occurs. This is mainly because the building owners do not previously have good tools to detect and diagnose these faults, determine their impact, and act on findings. Commercially available fault detection and diagnostics (FDD) tools have been developed to address this issue and have the potential to reduce equipment downtime, energy costs, maintenance costs, and improve occupant comfort and system reliability. However, many of these tools require an in-depth knowledge of system behavior and thermodynamic principles to interpret the results. In this paper, supervised and semi-supervised machine learning (ML) approaches are applied to datasets collected from an operating system in the field to develop new FDD methods and to help building owners see the value proposition of performing proactive maintenance. The study data was collected from one packaged rooftop unit (RTU) HVAC system running under normal operating conditions at an industrial facility in Connecticut. This paper compares three different approaches for fault classification for a real-time operating RTU using semi-supervised learning, achieving accuracies as high as 95.7% using few-shot learning.

大多数供暖、通风和空调（HVAC）系统在运行时都会出现一个或多个故障，这些故障会导致能耗增加，并可能随着时间的推移导致系统故障。如今，大多数建筑业主只进行反应性维护，在灾难性故障发生之前，他们可能不太关心或无法评估系统的健康状况。这主要是因为建筑业主以前没有好的工具来检测和诊断这些故障，确定其影响，并根据发现采取行动。已经开发了商用故障检测和诊断（FDD）工具来解决这个问题，并且有可能减少设备停机时间、能源成本、维护成本，并提高乘坐者的舒适性和系统可靠性。然而，其中许多工具需要深入了解系统行为和热力学原理才能解释结果。在本文中，将监督和半监督机器学习（ML）方法应用于从现场操作系统收集的数据集，以开发新的FDD方法，并帮助建筑物所有者了解执行主动维护的价值主张。研究数据是从康涅狄格州一家工业设施在正常运行条件下运行的一个成套屋顶单元（RTU）暖通空调系统中收集的。本文比较了使用半监督学习对实时操作RTU进行故障分类的三种不同方法，使用少镜头学习实现了高达95.7%的准确率。

{"title":"Semi-Supervised Machine Learning for Fault Detection and Diagnosis of a Rooftop Unit","authors":"Mohammed G. Albayati;Jalal Faraj;Amy Thompson;Prathamesh Patil;Ravi Gorthala;Sanguthevar Rajasekaran","doi":"10.26599/BDMA.2022.9020015","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020015","url":null,"abstract":"Most heating, ventilation, and air-conditioning (HVAC) systems operate with one or more faults that result in increased energy consumption and that could lead to system failure over time. Today, most building owners are performing reactive maintenance only and may be less concerned or less able to assess the health of the system until catastrophic failure occurs. This is mainly because the building owners do not previously have good tools to detect and diagnose these faults, determine their impact, and act on findings. Commercially available fault detection and diagnostics (FDD) tools have been developed to address this issue and have the potential to reduce equipment downtime, energy costs, maintenance costs, and improve occupant comfort and system reliability. However, many of these tools require an in-depth knowledge of system behavior and thermodynamic principles to interpret the results. In this paper, supervised and semi-supervised machine learning (ML) approaches are applied to datasets collected from an operating system in the field to develop new FDD methods and to help building owners see the value proposition of performing proactive maintenance. The study data was collected from one packaged rooftop unit (RTU) HVAC system running under normal operating conditions at an industrial facility in Connecticut. This paper compares three different approaches for fault classification for a real-time operating RTU using semi-supervised learning, achieving accuracies as high as 95.7% using few-shot learning.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 2","pages":"170-184"},"PeriodicalIF":13.6,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10026288/10026516.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67984956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Continuous and Discrete Similarity Coefficient for Identifying Essential Proteins Using Gene Expression Data 利用基因表达数据鉴定必需蛋白质的连续和离散相似系数

IF 13.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Mining and Analytics

Pub Date : 2023-01-26 DOI: 10.26599/BDMA.2022.9020019

Jiancheng Zhong;Zuohang Qu;Ying Zhong;Chao Tang;Yi Pan

Essential proteins play a vital role in biological processes, and the combination of gene expression profiles with Protein-Protein Interaction (PPI) networks can improve the identification of essential proteins. However, gene expression data are prone to significant fluctuations due to noise interference in topological networks. In this work, we discretized gene expression data and used the discrete similarities of the gene expression spectrum to eliminate noise fluctuation. We then proposed the Pearson Jaccard coefficient (PJC) that consisted of continuous and discrete similarities in the gene expression data. Using the graph theory as the basis, we fused the newly proposed similarity coefficient with the existing network topology prediction algorithm at each protein node to recognize essential proteins. This strategy exhibited a high recognition rate and good specificity. We validated the new similarity coefficient PJC on PPI datasets of Krogan, Gavin, and DIP of yeast species and evaluated the results by receiver operating characteristic analysis, jackknife analysis, top analysis, and accuracy analysis. Compared with that of node-based network topology centrality and fusion biological information centrality methods, the new similarity coefficient PJC showed a significantly improved prediction performance for essential proteins in DC, IC, Eigenvector centrality, subgraph centrality, betweenness centrality, closeness centrality, NC, PeC, and WDC. We also compared the PJC coefficient with other methods using the NF-PIN algorithm, which predicts proteins by constructing active PPI networks through dynamic gene expression. The experimental results proved that our newly proposed similarity coefficient PJC has superior advantages in predicting essential proteins.

必需蛋白在生物过程中发挥着至关重要的作用，将基因表达谱与蛋白质-蛋白质相互作用（PPI）网络相结合可以提高必需蛋白的鉴定。然而，由于拓扑网络中的噪声干扰，基因表达数据容易出现显著波动。在这项工作中，我们对基因表达数据进行了离散化，并利用基因表达谱的离散相似性来消除噪声波动。然后，我们提出了由基因表达数据中的连续和离散相似性组成的皮尔逊-雅克卡系数（PJC）。以图论为基础，在每个蛋白质节点将新提出的相似系数与现有的网络拓扑预测算法融合，以识别必需蛋白质。该策略具有较高的识别率和良好的特异性。我们在酵母物种的Krogan、Gavin和DIP的PPI数据集上验证了新的相似系数PJC，并通过受试者操作特征分析、jackknife分析、顶部分析和准确性分析对结果进行了评估。与基于节点的网络拓扑中心性和融合生物信息中心性方法相比，新的相似系数PJC对DC、IC、特征向量中心性、子图中心性、介数中心性、接近中心性、NC、PeC和WDC中的基本蛋白的预测性能显著提高。我们还将PJC系数与使用NF-PIN算法的其他方法进行了比较，该算法通过动态基因表达构建活性PPI网络来预测蛋白质。实验结果证明，我们新提出的相似系数PJC在预测必需蛋白质方面具有优越的优势。

{"title":"Continuous and Discrete Similarity Coefficient for Identifying Essential Proteins Using Gene Expression Data","authors":"Jiancheng Zhong;Zuohang Qu;Ying Zhong;Chao Tang;Yi Pan","doi":"10.26599/BDMA.2022.9020019","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020019","url":null,"abstract":"Essential proteins play a vital role in biological processes, and the combination of gene expression profiles with Protein-Protein Interaction (PPI) networks can improve the identification of essential proteins. However, gene expression data are prone to significant fluctuations due to noise interference in topological networks. In this work, we discretized gene expression data and used the discrete similarities of the gene expression spectrum to eliminate noise fluctuation. We then proposed the Pearson Jaccard coefficient (PJC) that consisted of continuous and discrete similarities in the gene expression data. Using the graph theory as the basis, we fused the newly proposed similarity coefficient with the existing network topology prediction algorithm at each protein node to recognize essential proteins. This strategy exhibited a high recognition rate and good specificity. We validated the new similarity coefficient PJC on PPI datasets of Krogan, Gavin, and DIP of yeast species and evaluated the results by receiver operating characteristic analysis, jackknife analysis, top analysis, and accuracy analysis. Compared with that of node-based network topology centrality and fusion biological information centrality methods, the new similarity coefficient PJC showed a significantly improved prediction performance for essential proteins in DC, IC, Eigenvector centrality, subgraph centrality, betweenness centrality, closeness centrality, NC, PeC, and WDC. We also compared the PJC coefficient with other methods using the NF-PIN algorithm, which predicts proteins by constructing active PPI networks through dynamic gene expression. The experimental results proved that our newly proposed similarity coefficient PJC has superior advantages in predicting essential proteins.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 2","pages":"185-200"},"PeriodicalIF":13.6,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10026288/10026519.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67984957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Denoising Graph Inference Network for Document-Level Relation Extraction 用于文档级关系提取的去噪图推理网络

IF 13.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Mining and Analytics

Pub Date : 2023-01-26 DOI: 10.26599/BDMA.2022.9020051

Hailin Wang;Ke Qin;Guiduo Duan;Guangchun Luo

Relation Extraction (RE) is to obtain a predefined relation type of two entities mentioned in a piece of text, e.g., a sentence-level or a document-level text. Most existing studies suffer from the noise in the text, and necessary pruning is of great importance. The conventional sentence-level RE task addresses this issue by a denoising method using the shortest dependency path to build a long-range semantic dependency between entity pairs. However, this kind of denoising method is scarce in document-level RE. In this work, we explicitly model a denoised document-level graph based on linguistic knowledge to capture various long-range semantic dependencies among entities. We first formalize a Syntactic Dependency Tree forest (SDT-forest) by introducing the syntax and discourse dependency relation. Then, the Steiner tree algorithm extracts a mention-level denoised graph, Steiner Graph (SG), removing linguistically irrelevant words from the SDT-forest. We then devise a slide residual attention to highlight word-level evidence on text and SG. Finally, the classification is established on the SG to infer the relations of entity pairs. We conduct extensive experiments on three public datasets. The results evidence that our method is beneficial to establish long-range semantic dependency and can improve the classification performance with longer texts.

关系提取（RE）是获取一段文本中提到的两个实体的预定义关系类型，例如句子级别或文档级别的文本。现有的大多数研究都受到文本噪音的影响，必要的修剪非常重要。传统的句子级RE任务通过使用最短依赖路径在实体对之间建立长程语义依赖的去噪方法来解决这个问题。然而，这种去噪方法在文档级RE中很少。在这项工作中，我们基于语言知识显式地建模去噪的文档级图，以捕捉实体之间的各种长程语义依赖关系。通过引入句法和语篇依赖关系，我们首先形式化了一个句法依赖树森林（SDT森林）。然后，Steiner树算法提取一个提及级去噪图，即Steiner图（SG），从SDT森林中去除与语言无关的单词。然后，我们设计了一个幻灯片残差注意力来突出文本和SG上的单词级证据。最后，在SG上建立分类来推断实体对的关系。我们在三个公共数据集上进行了广泛的实验。结果表明，我们的方法有利于建立长程语义依赖关系，并可以提高较长文本的分类性能。

{"title":"Denoising Graph Inference Network for Document-Level Relation Extraction","authors":"Hailin Wang;Ke Qin;Guiduo Duan;Guangchun Luo","doi":"10.26599/BDMA.2022.9020051","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020051","url":null,"abstract":"Relation Extraction (RE) is to obtain a predefined relation type of two entities mentioned in a piece of text, e.g., a sentence-level or a document-level text. Most existing studies suffer from the noise in the text, and necessary pruning is of great importance. The conventional sentence-level RE task addresses this issue by a denoising method using the shortest dependency path to build a long-range semantic dependency between entity pairs. However, this kind of denoising method is scarce in document-level RE. In this work, we explicitly model a denoised document-level graph based on linguistic knowledge to capture various long-range semantic dependencies among entities. We first formalize a Syntactic Dependency Tree forest (SDT-forest) by introducing the syntax and discourse dependency relation. Then, the Steiner tree algorithm extracts a mention-level denoised graph, Steiner Graph (SG), removing linguistically irrelevant words from the SDT-forest. We then devise a slide residual attention to highlight word-level evidence on text and SG. Finally, the classification is established on the SG to infer the relations of entity pairs. We conduct extensive experiments on three public datasets. The results evidence that our method is beneficial to establish long-range semantic dependency and can improve the classification performance with longer texts.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 2","pages":"248-262"},"PeriodicalIF":13.6,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10026288/10026508.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67984889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DeepRetention: A Deep Learning Approach for Intron Retention Detection 深度保留：一种用于内含子保留检测的深度学习方法

IF 13.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Mining and Analytics

Pub Date : 2023-01-25 DOI: 10.26599/BDMA.2022.9020023

Zhenpeng Wu;Jiantao Zheng;Jiashu Liu;Cuixiang Lin;Hong-Dong Li

As the least understood mode of alternative splicing, Intron Retention (IR) is emerging as an interesting area and has attracted more and more attention in the field of gene regulation and disease studies. Existing methods detect IR exclusively based on one or a few predefined metrics describing local or summarized characteristics of retained introns. These metrics are not able to describe the pattern of sequencing depth of intronic reads, which is an intuitive and informative characteristic of retained introns. We hypothesize that incorporating the distribution pattern of intronic reads will improve the accuracy of IR detection. Here we present DeepRetention, a novel approach for IR detection by modeling the pattern of sequencing depth of introns. Due to the lack of a gold standard dataset of IR, we first compare DeepRetention with two state-of-the-art methods, i.e. iREAD and IRFinder, on simulated RNA-seq datasets with retained introns. The results show that DeepRetention outperforms these two methods. Next, DeepRetention performs well when it is applied to third-generation long-read RNA-seq data, while IRFinder and iREAD are not applicable to detecting IR from the third-generation sequencing data. Further, we show that IRs predicted by DeepRetention are biologically meaningful on an RNA-seq dataset from Alzheimer's Disease (AD) samples. The differential IRs are found to be significantly associated with AD based on statistical evaluation of an AD-specific functional gene network. The parent genes of differential IRs are enriched in AD-related functions. In summary, DeepRetention detects IR from a new angle of view, providing a valuable tool for IR analysis.

作为人们最不了解的选择性剪接模式，内含子保留（IR）正成为一个有趣的领域，并在基因调控和疾病研究领域引起了越来越多的关注。现有方法仅基于描述保留内含子的局部或概括特征的一个或几个预定义指标来检测IR。这些指标无法描述内含子阅读的测序深度模式，这是保留内含子的直观和信息特征。我们假设结合内含子读数的分布模式将提高IR检测的准确性。在这里，我们介绍了DeepRetention，这是一种通过模拟内含子测序深度模式进行IR检测的新方法。由于缺乏IR的金标准数据集，我们首先在具有保留内含子的模拟RNA-seq数据集上比较了DeepRetention与两种最先进的方法，即iREAD和IRFinder。结果表明，DeepRetention的性能优于这两种方法。接下来，DeepRetention在应用于第三代长读RNA-seq数据时表现良好，而IRFinder和iREAD不适用于从第三代测序数据中检测IR。此外，我们还表明，DeepRetention预测的IRs在阿尔茨海默病（AD）样本的RNA-seq数据集上具有生物学意义。基于AD特异性功能基因网络的统计评估，发现差异IR与AD显著相关。差异IRs的亲本基因富含AD相关功能。总之，DeepRetention从一个新的角度检测IR，为IR分析提供了一个有价值的工具。

{"title":"DeepRetention: A Deep Learning Approach for Intron Retention Detection","authors":"Zhenpeng Wu;Jiantao Zheng;Jiashu Liu;Cuixiang Lin;Hong-Dong Li","doi":"10.26599/BDMA.2022.9020023","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020023","url":null,"abstract":"As the least understood mode of alternative splicing, Intron Retention (IR) is emerging as an interesting area and has attracted more and more attention in the field of gene regulation and disease studies. Existing methods detect IR exclusively based on one or a few predefined metrics describing local or summarized characteristics of retained introns. These metrics are not able to describe the pattern of sequencing depth of intronic reads, which is an intuitive and informative characteristic of retained introns. We hypothesize that incorporating the distribution pattern of intronic reads will improve the accuracy of IR detection. Here we present DeepRetention, a novel approach for IR detection by modeling the pattern of sequencing depth of introns. Due to the lack of a gold standard dataset of IR, we first compare DeepRetention with two state-of-the-art methods, i.e. iREAD and IRFinder, on simulated RNA-seq datasets with retained introns. The results show that DeepRetention outperforms these two methods. Next, DeepRetention performs well when it is applied to third-generation long-read RNA-seq data, while IRFinder and iREAD are not applicable to detecting IR from the third-generation sequencing data. Further, we show that IRs predicted by DeepRetention are biologically meaningful on an RNA-seq dataset from Alzheimer's Disease (AD) samples. The differential IRs are found to be significantly associated with AD based on statistical evaluation of an AD-specific functional gene network. The parent genes of differential IRs are enriched in AD-related functions. In summary, DeepRetention detects IR from a new angle of view, providing a valuable tool for IR analysis.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 2","pages":"115-126"},"PeriodicalIF":13.6,"publicationDate":"2023-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10026288/10026289.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67984953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Ultra-Short Wave Communication Squelch Algorithm Based on Deep Neural Network 基于深度神经网络的超短波通信静噪算法

IF 13.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Mining and Analytics

Pub Date : 2022-11-24 DOI: 10.26599/BDMA.2022.9020025

Yuanxin Xiang;Yi Lv;Wenqiang Lei;Jiancheng Lv

The squelch problem of ultra-short wave communication under non-stationary noise and low Signal-to-Noise Ratio (SNR) in a complex electromagnetic environment is still challenging. To alleviate the problem, we proposed a squelch algorithm for ultra-short wave communication based on a deep neural network and the traditional energy decision method. The proposed algorithm first predicts the speech existence probability using a three-layer Gated Recurrent Unit (GRU) with the speech banding spectrum as the feature. Then it gets the final squelch result by combining the strength of the signal energy and the speech existence probability. Multiple simulations and experiments are done to verify the robustness and effectiveness of the proposed algorithm. We simulate the algorithm in three situations: the typical Amplitude Modulation (AM) and Frequency Modulation (FM) in the ultra-short wave communication under different SNR environments, the non-stationary burst-like noise environments, and the real received signal of the ultra-short wave radio. The experimental results show that the proposed algorithm performs better than the traditional squelch methods in all the simulations and experiments. In particular, the false alarm rate of the proposed squelch algorithm for non-stationary burst-like noise is significantly lower than that of traditional squelch methods.

在复杂的电磁环境中，超短波通信在非平稳噪声和低信噪比下的静噪问题仍然具有挑战性。为了缓解这一问题，我们提出了一种基于深度神经网络和传统能量决策方法的超短波通信静噪算法。该算法首先使用三层门控递归单元（GRU）以语音带谱为特征来预测语音存在概率。然后将信号能量的强度与语音存在概率相结合，得到最终的静噪结果。通过多次仿真和实验验证了该算法的鲁棒性和有效性。我们在三种情况下模拟了该算法：不同信噪比环境下超短波通信中的典型调幅（AM）和调频（FM），非平稳突发噪声环境，以及超短波无线电的真实接收信号。实验结果表明，在所有的仿真和实验中，该算法都优于传统的静噪方法。特别是，所提出的非平稳类突发噪声静噪算法的虚警率显著低于传统静噪方法。

{"title":"Ultra-Short Wave Communication Squelch Algorithm Based on Deep Neural Network","authors":"Yuanxin Xiang;Yi Lv;Wenqiang Lei;Jiancheng Lv","doi":"10.26599/BDMA.2022.9020025","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020025","url":null,"abstract":"The squelch problem of ultra-short wave communication under non-stationary noise and low Signal-to-Noise Ratio (SNR) in a complex electromagnetic environment is still challenging. To alleviate the problem, we proposed a squelch algorithm for ultra-short wave communication based on a deep neural network and the traditional energy decision method. The proposed algorithm first predicts the speech existence probability using a three-layer Gated Recurrent Unit (GRU) with the speech banding spectrum as the feature. Then it gets the final squelch result by combining the strength of the signal energy and the speech existence probability. Multiple simulations and experiments are done to verify the robustness and effectiveness of the proposed algorithm. We simulate the algorithm in three situations: the typical Amplitude Modulation (AM) and Frequency Modulation (FM) in the ultra-short wave communication under different SNR environments, the non-stationary burst-like noise environments, and the real received signal of the ultra-short wave radio. The experimental results show that the proposed algorithm performs better than the traditional squelch methods in all the simulations and experiments. In particular, the false alarm rate of the proposed squelch algorithm for non-stationary burst-like noise is significantly lower than that of traditional squelch methods.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 1","pages":"106-114"},"PeriodicalIF":13.6,"publicationDate":"2022-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9962810/09962958.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68007926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2