Pub Date : 2023-04-07DOI: 10.26599/BDMA.2022.9020029
S. B. G. Tilak Babu;Ch Srinivasa Rao
Passive image forgery detection methods that identify forgeries without prior knowledge have become a key research focus. In copy-move forgery, the assailant intends to hide a portion of an image by pasting other portions of the same image. The detection of such manipulations in images has great demand in legal evidence, forensic investigation, and many other fields. The paper aims to present copy-move forgery detection algorithms with the help of advanced feature descriptors, such as local ternary pattern, local phase quantization, local Gabor binary pattern histogram sequence, Weber local descriptor, and local monotonic pattern, and classifiers such as optimized support vector machine and optimized NBC. The proposed algorithms can classify an image efficiently as either copy-move forged or authenticated, even if the test image is subjected to attacks such as JPEG compression, scaling, rotation, and brightness variation. CoMoFoD, CASIA, and MICC datasets and a combination of CoMoFoD and CASIA datasets images are used to quantify the performance of the proposed algorithms. The proposed algorithms are more efficient than state-of-the-art algorithms even though the suspected image is post-processed.
{"title":"Copy-Move Forgery Verification in Images Using Local Feature Extractors and Optimized Classifiers","authors":"S. B. G. Tilak Babu;Ch Srinivasa Rao","doi":"10.26599/BDMA.2022.9020029","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020029","url":null,"abstract":"Passive image forgery detection methods that identify forgeries without prior knowledge have become a key research focus. In copy-move forgery, the assailant intends to hide a portion of an image by pasting other portions of the same image. The detection of such manipulations in images has great demand in legal evidence, forensic investigation, and many other fields. The paper aims to present copy-move forgery detection algorithms with the help of advanced feature descriptors, such as local ternary pattern, local phase quantization, local Gabor binary pattern histogram sequence, Weber local descriptor, and local monotonic pattern, and classifiers such as optimized support vector machine and optimized NBC. The proposed algorithms can classify an image efficiently as either copy-move forged or authenticated, even if the test image is subjected to attacks such as JPEG compression, scaling, rotation, and brightness variation. CoMoFoD, CASIA, and MICC datasets and a combination of CoMoFoD and CASIA datasets images are used to quantify the performance of the proposed algorithms. The proposed algorithms are more efficient than state-of-the-art algorithms even though the suspected image is post-processed.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 3","pages":"347-360"},"PeriodicalIF":13.6,"publicationDate":"2023-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10097649/10097650.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67838277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-26DOI: 10.26599/BDMA.2022.9020047
Yan Huang;Yi Joy Li;Zhipeng Cai
Metaverse describes a new shape of cyberspace and has become a hot-trending word since 2021. There are many explanations about what Meterverse is and attempts to provide a formal standard or definition of Metaverse. However, these definitions could hardly reach universal acceptance. Rather than providing a formal definition of the Metaverse, we list four must-have characteristics of the Metaverse: socialization, immersive interaction, real world-building, and expandability. These characteristics not only carve the Metaverse into a novel and fantastic digital world, but also make it suffer from all security/privacy risks, such as personal information leakage, eavesdropping, unauthorized access, phishing, data injection, broken authentication, insecure design, and more. This paper first introduces the four characteristics, then the current progress and typical applications of the Metaverse are surveyed and categorized into four economic sectors. Based on the four characteristics and the findings of the current progress, the security and privacy issues in the Metaverse are investigated. We then identify and discuss more potential critical security and privacy issues that can be caused by combining the four characteristics. Lastly, the paper also raises some other concerns regarding society and humanity.
{"title":"Security and Privacy in Metaverse: A Comprehensive Survey","authors":"Yan Huang;Yi Joy Li;Zhipeng Cai","doi":"10.26599/BDMA.2022.9020047","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020047","url":null,"abstract":"Metaverse describes a new shape of cyberspace and has become a hot-trending word since 2021. There are many explanations about what Meterverse is and attempts to provide a formal standard or definition of Metaverse. However, these definitions could hardly reach universal acceptance. Rather than providing a formal definition of the Metaverse, we list four must-have characteristics of the Metaverse: socialization, immersive interaction, real world-building, and expandability. These characteristics not only carve the Metaverse into a novel and fantastic digital world, but also make it suffer from all security/privacy risks, such as personal information leakage, eavesdropping, unauthorized access, phishing, data injection, broken authentication, insecure design, and more. This paper first introduces the four characteristics, then the current progress and typical applications of the Metaverse are surveyed and categorized into four economic sectors. Based on the four characteristics and the findings of the current progress, the security and privacy issues in the Metaverse are investigated. We then identify and discuss more potential critical security and privacy issues that can be caused by combining the four characteristics. Lastly, the paper also raises some other concerns regarding society and humanity.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 2","pages":"234-247"},"PeriodicalIF":13.6,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10026288/10026513.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67984888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Distributed computing frameworks are the fundamental component of distributed computing systems. They provide an essential way to support the efficient processing of big data on clusters or cloud. The size of big data increases at a pace that is faster than the increase in the big data processing capacity of clusters. Thus, distributed computing frameworks based on the MapReduce computing model are not adequate to support big data analysis tasks which often require running complex analytical algorithms on extremely big data sets in terabytes. In performing such tasks, these frameworks face three challenges: computational inefficiency due to high I/O and communication costs, non-scalability to big data due to memory limit, and limited analytical algorithms because many serial algorithms cannot be implemented in the MapReduce programming model. New distributed computing frameworks need to be developed to conquer these challenges. In this paper, we review MapReduce-type distributed computing frameworks that are currently used in handling big data and discuss their problems when conducting big data analysis. In addition, we present a non-MapReduce distributed computing framework that has the potential to overcome big data analysis challenges.
{"title":"Survey of Distributed Computing Frameworks for Supporting Big Data Analysis","authors":"Xudong Sun;Yulin He;Dingming Wu;Joshua Zhexue Huang","doi":"10.26599/BDMA.2022.9020014","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020014","url":null,"abstract":"Distributed computing frameworks are the fundamental component of distributed computing systems. They provide an essential way to support the efficient processing of big data on clusters or cloud. The size of big data increases at a pace that is faster than the increase in the big data processing capacity of clusters. Thus, distributed computing frameworks based on the MapReduce computing model are not adequate to support big data analysis tasks which often require running complex analytical algorithms on extremely big data sets in terabytes. In performing such tasks, these frameworks face three challenges: computational inefficiency due to high I/O and communication costs, non-scalability to big data due to memory limit, and limited analytical algorithms because many serial algorithms cannot be implemented in the MapReduce programming model. New distributed computing frameworks need to be developed to conquer these challenges. In this paper, we review MapReduce-type distributed computing frameworks that are currently used in handling big data and discuss their problems when conducting big data analysis. In addition, we present a non-MapReduce distributed computing framework that has the potential to overcome big data analysis challenges.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 2","pages":"154-169"},"PeriodicalIF":13.6,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10026288/10026506.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67984954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-26DOI: 10.26599/BDMA.2022.9020016
Santhosh S;Narayana Swamy Ramaiah
At present, hundreds of cloud vendors in the global market provide various services based on a customer's requirements. All cloud vendors are not the same in terms of the number of services, infrastructure availability, security strategies, cost per customer, and reputation in the market. Thus, software developers and organizations face a dilemma when choosing a suitable cloud vendor for their developmental activities. Thus, there is a need to evaluate various cloud service providers (CSPs) and platforms before choosing a suitable vendor. Already existing solutions are either based on simulation tools as per the requirements or evaluated concerning the quality of service attributes. However, they require more time to collect data, simulate and evaluate the vendor. The proposed work compares various CSPs in terms of major metrics, such as establishment, services, infrastructure, tools, pricing models, market share, etc., based on the comparison, parameter ranking, and weightage allocated. Furthermore, the parameters are categorized depending on the priority level. The weighted average is calculated for each CSP, after which the values are sorted in descending order. The experimental results show the unbiased selection of CSPs based on the chosen parameters. The proposed parameter-ranking priority level weightage (PRPLW) algorithm simplifies the selection of the best-suited cloud vendor in accordance with the requirements of software development.
{"title":"Cloud-Based Software Development Lifecycle: A Simplified Algorithm for Cloud Service Provider Evaluation with Metric Analysis","authors":"Santhosh S;Narayana Swamy Ramaiah","doi":"10.26599/BDMA.2022.9020016","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020016","url":null,"abstract":"At present, hundreds of cloud vendors in the global market provide various services based on a customer's requirements. All cloud vendors are not the same in terms of the number of services, infrastructure availability, security strategies, cost per customer, and reputation in the market. Thus, software developers and organizations face a dilemma when choosing a suitable cloud vendor for their developmental activities. Thus, there is a need to evaluate various cloud service providers (CSPs) and platforms before choosing a suitable vendor. Already existing solutions are either based on simulation tools as per the requirements or evaluated concerning the quality of service attributes. However, they require more time to collect data, simulate and evaluate the vendor. The proposed work compares various CSPs in terms of major metrics, such as establishment, services, infrastructure, tools, pricing models, market share, etc., based on the comparison, parameter ranking, and weightage allocated. Furthermore, the parameters are categorized depending on the priority level. The weighted average is calculated for each CSP, after which the values are sorted in descending order. The experimental results show the unbiased selection of CSPs based on the chosen parameters. The proposed parameter-ranking priority level weightage (PRPLW) algorithm simplifies the selection of the best-suited cloud vendor in accordance with the requirements of software development.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 2","pages":"127-138"},"PeriodicalIF":13.6,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10026288/10026515.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67984952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-26DOI: 10.26599/BDMA.2022.9020034
Jian Mao;Xiaohe Xu;Qixiao Lin;Liran Ma;Jianwei Liu
Typical Internet of Things (IoT) systems are event-driven platforms, in which smart sensing devices sense or subscribe to events (device state changes), and react according to the preconfigured trigger-action logic, as known as, automation rules. “Events” are essential elements to perform automatic control in an IoT system. However, events are not always trustworthy. Sensing fake event notifications injected by attackers (called event spoofing attack) can trigger sensitive actions through automation rules without involving authorized users. Existing solutions verify events via “event fingerprints” extracted by surrounding sensors. However, if a system has homogeneous sensors that have strong correlations among them, traditional threshold-based methods may cause information redundancy and noise amplification, consequently, decreasing the checking accuracy. Aiming at this, in this paper, we propose “EScope”, an effective event validation approach to check the authenticity of system events based on device state correlation. EScope selects informative and representative sensors using an Neural-Network-based (NN-based) sensor selection component and extracts a verification sensor set for event validation. We evaluate our approach using an existing dataset provided by Peeves. The experiment results demonstrate that EScope achieves an average 67% sensor amount reduction on 22 events compared with the existing work, and increases the event spoofing detection accuracy.
{"title":"EScope: Effective Event Validation for IoT Systems Based on State Correlation","authors":"Jian Mao;Xiaohe Xu;Qixiao Lin;Liran Ma;Jianwei Liu","doi":"10.26599/BDMA.2022.9020034","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020034","url":null,"abstract":"Typical Internet of Things (IoT) systems are event-driven platforms, in which smart sensing devices sense or subscribe to events (device state changes), and react according to the preconfigured trigger-action logic, as known as, automation rules. “Events” are essential elements to perform automatic control in an IoT system. However, events are not always trustworthy. Sensing fake event notifications injected by attackers (called event spoofing attack) can trigger sensitive actions through automation rules without involving authorized users. Existing solutions verify events via “event fingerprints” extracted by surrounding sensors. However, if a system has homogeneous sensors that have strong correlations among them, traditional threshold-based methods may cause information redundancy and noise amplification, consequently, decreasing the checking accuracy. Aiming at this, in this paper, we propose “EScope”, an effective event validation approach to check the authenticity of system events based on device state correlation. EScope selects informative and representative sensors using an Neural-Network-based (NN-based) sensor selection component and extracts a verification sensor set for event validation. We evaluate our approach using an existing dataset provided by Peeves. The experiment results demonstrate that EScope achieves an average 67% sensor amount reduction on 22 events compared with the existing work, and increases the event spoofing detection accuracy.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 2","pages":"218-233"},"PeriodicalIF":13.6,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10026288/10026512.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67984886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-26DOI: 10.26599/BDMA.2022.9020021
Xuehong Wu;Junwen Duan;Yi Pan;Min Li
Medical knowledge graphs (MKGs) are the basis for intelligent health care, and they have been in use in a variety of intelligent medical applications. Thus, understanding the research and application development of MKGs will be crucial for future relevant research in the biomedical field. To this end, we offer an in-depth review of MKG in this work. Our research begins with the examination of four types of medical information sources, knowledge graph creation methodologies, and six major themes for MKG development. Furthermore, three popular models of reasoning from the viewpoint of knowledge reasoning are discussed. A reasoning implementation path (RIP) is proposed as a means of expressing the reasoning procedures for MKG. In addition, we explore intelligent medical applications based on RIP and MKG and classify them into nine major types. Finally, we summarize the current state of MKG research based on more than 130 publications and future challenges and opportunities.
{"title":"Medical Knowledge Graph: Data Sources, Construction, Reasoning, and Applications","authors":"Xuehong Wu;Junwen Duan;Yi Pan;Min Li","doi":"10.26599/BDMA.2022.9020021","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020021","url":null,"abstract":"Medical knowledge graphs (MKGs) are the basis for intelligent health care, and they have been in use in a variety of intelligent medical applications. Thus, understanding the research and application development of MKGs will be crucial for future relevant research in the biomedical field. To this end, we offer an in-depth review of MKG in this work. Our research begins with the examination of four types of medical information sources, knowledge graph creation methodologies, and six major themes for MKG development. Furthermore, three popular models of reasoning from the viewpoint of knowledge reasoning are discussed. A reasoning implementation path (RIP) is proposed as a means of expressing the reasoning procedures for MKG. In addition, we explore intelligent medical applications based on RIP and MKG and classify them into nine major types. Finally, we summarize the current state of MKG research based on more than 130 publications and future challenges and opportunities.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 2","pages":"201-217"},"PeriodicalIF":13.6,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10026288/10026520.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67984887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Effective management of daily road traffic is a huge challenge for traffic personnel. Urban traffic management has come a long way from manual control to artificial intelligence techniques. Still real-time adaptive traffic control is an unfulfilled dream due to lack of low cost and easy to install traffic sensor with real-time communication capability. With increasing number of on-board Bluetooth devices in new generation automobiles, these devices can act as sensors to convey the traffic information indirectly. This paper presents the efficacy of road-side Bluetooth scanners for traffic data collection and big-data analytics to process the collected data to extract traffic parameters. Extracted information and analysis are presented through visualizations and tables. All data analytics and visualizations are carried out off-line in R Studio environment. Reliability aspects of the collected and processed data are also investigated. Higher speed of traffic in one direction owing to the geometry of the road is also established through data analysis. Increased penetration of smart phones and fitness bands in day to day use is also established through the device type of the data collected. The results of this work can be used for regular data collection compared to the traditional road surveys carried out annually or bi-annually. It is also found that compared to previous studies published in the literature, the device penetration rate and sample size found in this study are quite high and very encouraging. This is a novel work in literature, which would be quite useful for effective road traffic management in future.
{"title":"Efficacy of Bluetooth-Based Data Collection for Road Traffic Analysis and Visualization Using Big Data Analytics","authors":"Ashish Rajeshwar Kulkarni;Narendra Kumar;K. Ramachandra Rao","doi":"10.26599/BDMA.2022.9020039","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020039","url":null,"abstract":"Effective management of daily road traffic is a huge challenge for traffic personnel. Urban traffic management has come a long way from manual control to artificial intelligence techniques. Still real-time adaptive traffic control is an unfulfilled dream due to lack of low cost and easy to install traffic sensor with real-time communication capability. With increasing number of on-board Bluetooth devices in new generation automobiles, these devices can act as sensors to convey the traffic information indirectly. This paper presents the efficacy of road-side Bluetooth scanners for traffic data collection and big-data analytics to process the collected data to extract traffic parameters. Extracted information and analysis are presented through visualizations and tables. All data analytics and visualizations are carried out off-line in R Studio environment. Reliability aspects of the collected and processed data are also investigated. Higher speed of traffic in one direction owing to the geometry of the road is also established through data analysis. Increased penetration of smart phones and fitness bands in day to day use is also established through the device type of the data collected. The results of this work can be used for regular data collection compared to the traditional road surveys carried out annually or bi-annually. It is also found that compared to previous studies published in the literature, the device penetration rate and sample size found in this study are quite high and very encouraging. This is a novel work in literature, which would be quite useful for effective road traffic management in future.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 2","pages":"139-153"},"PeriodicalIF":13.6,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10026288/10026507.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67984955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-26DOI: 10.26599/BDMA.2022.9020015
Mohammed G. Albayati;Jalal Faraj;Amy Thompson;Prathamesh Patil;Ravi Gorthala;Sanguthevar Rajasekaran
Most heating, ventilation, and air-conditioning (HVAC) systems operate with one or more faults that result in increased energy consumption and that could lead to system failure over time. Today, most building owners are performing reactive maintenance only and may be less concerned or less able to assess the health of the system until catastrophic failure occurs. This is mainly because the building owners do not previously have good tools to detect and diagnose these faults, determine their impact, and act on findings. Commercially available fault detection and diagnostics (FDD) tools have been developed to address this issue and have the potential to reduce equipment downtime, energy costs, maintenance costs, and improve occupant comfort and system reliability. However, many of these tools require an in-depth knowledge of system behavior and thermodynamic principles to interpret the results. In this paper, supervised and semi-supervised machine learning (ML) approaches are applied to datasets collected from an operating system in the field to develop new FDD methods and to help building owners see the value proposition of performing proactive maintenance. The study data was collected from one packaged rooftop unit (RTU) HVAC system running under normal operating conditions at an industrial facility in Connecticut. This paper compares three different approaches for fault classification for a real-time operating RTU using semi-supervised learning, achieving accuracies as high as 95.7% using few-shot learning.
{"title":"Semi-Supervised Machine Learning for Fault Detection and Diagnosis of a Rooftop Unit","authors":"Mohammed G. Albayati;Jalal Faraj;Amy Thompson;Prathamesh Patil;Ravi Gorthala;Sanguthevar Rajasekaran","doi":"10.26599/BDMA.2022.9020015","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020015","url":null,"abstract":"Most heating, ventilation, and air-conditioning (HVAC) systems operate with one or more faults that result in increased energy consumption and that could lead to system failure over time. Today, most building owners are performing reactive maintenance only and may be less concerned or less able to assess the health of the system until catastrophic failure occurs. This is mainly because the building owners do not previously have good tools to detect and diagnose these faults, determine their impact, and act on findings. Commercially available fault detection and diagnostics (FDD) tools have been developed to address this issue and have the potential to reduce equipment downtime, energy costs, maintenance costs, and improve occupant comfort and system reliability. However, many of these tools require an in-depth knowledge of system behavior and thermodynamic principles to interpret the results. In this paper, supervised and semi-supervised machine learning (ML) approaches are applied to datasets collected from an operating system in the field to develop new FDD methods and to help building owners see the value proposition of performing proactive maintenance. The study data was collected from one packaged rooftop unit (RTU) HVAC system running under normal operating conditions at an industrial facility in Connecticut. This paper compares three different approaches for fault classification for a real-time operating RTU using semi-supervised learning, achieving accuracies as high as 95.7% using few-shot learning.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 2","pages":"170-184"},"PeriodicalIF":13.6,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10026288/10026516.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67984956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-26DOI: 10.26599/BDMA.2022.9020019
Jiancheng Zhong;Zuohang Qu;Ying Zhong;Chao Tang;Yi Pan
Essential proteins play a vital role in biological processes, and the combination of gene expression profiles with Protein-Protein Interaction (PPI) networks can improve the identification of essential proteins. However, gene expression data are prone to significant fluctuations due to noise interference in topological networks. In this work, we discretized gene expression data and used the discrete similarities of the gene expression spectrum to eliminate noise fluctuation. We then proposed the Pearson Jaccard coefficient (PJC) that consisted of continuous and discrete similarities in the gene expression data. Using the graph theory as the basis, we fused the newly proposed similarity coefficient with the existing network topology prediction algorithm at each protein node to recognize essential proteins. This strategy exhibited a high recognition rate and good specificity. We validated the new similarity coefficient PJC on PPI datasets of Krogan, Gavin, and DIP of yeast species and evaluated the results by receiver operating characteristic analysis, jackknife analysis, top analysis, and accuracy analysis. Compared with that of node-based network topology centrality and fusion biological information centrality methods, the new similarity coefficient PJC showed a significantly improved prediction performance for essential proteins in DC, IC, Eigenvector centrality, subgraph centrality, betweenness centrality, closeness centrality, NC, PeC, and WDC. We also compared the PJC coefficient with other methods using the NF-PIN algorithm, which predicts proteins by constructing active PPI networks through dynamic gene expression. The experimental results proved that our newly proposed similarity coefficient PJC has superior advantages in predicting essential proteins.
{"title":"Continuous and Discrete Similarity Coefficient for Identifying Essential Proteins Using Gene Expression Data","authors":"Jiancheng Zhong;Zuohang Qu;Ying Zhong;Chao Tang;Yi Pan","doi":"10.26599/BDMA.2022.9020019","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020019","url":null,"abstract":"Essential proteins play a vital role in biological processes, and the combination of gene expression profiles with Protein-Protein Interaction (PPI) networks can improve the identification of essential proteins. However, gene expression data are prone to significant fluctuations due to noise interference in topological networks. In this work, we discretized gene expression data and used the discrete similarities of the gene expression spectrum to eliminate noise fluctuation. We then proposed the Pearson Jaccard coefficient (PJC) that consisted of continuous and discrete similarities in the gene expression data. Using the graph theory as the basis, we fused the newly proposed similarity coefficient with the existing network topology prediction algorithm at each protein node to recognize essential proteins. This strategy exhibited a high recognition rate and good specificity. We validated the new similarity coefficient PJC on PPI datasets of Krogan, Gavin, and DIP of yeast species and evaluated the results by receiver operating characteristic analysis, jackknife analysis, top analysis, and accuracy analysis. Compared with that of node-based network topology centrality and fusion biological information centrality methods, the new similarity coefficient PJC showed a significantly improved prediction performance for essential proteins in DC, IC, Eigenvector centrality, subgraph centrality, betweenness centrality, closeness centrality, NC, PeC, and WDC. We also compared the PJC coefficient with other methods using the NF-PIN algorithm, which predicts proteins by constructing active PPI networks through dynamic gene expression. The experimental results proved that our newly proposed similarity coefficient PJC has superior advantages in predicting essential proteins.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 2","pages":"185-200"},"PeriodicalIF":13.6,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10026288/10026519.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67984957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-26DOI: 10.26599/BDMA.2022.9020051
Hailin Wang;Ke Qin;Guiduo Duan;Guangchun Luo
Relation Extraction (RE) is to obtain a predefined relation type of two entities mentioned in a piece of text, e.g., a sentence-level or a document-level text. Most existing studies suffer from the noise in the text, and necessary pruning is of great importance. The conventional sentence-level RE task addresses this issue by a denoising method using the shortest dependency path to build a long-range semantic dependency between entity pairs. However, this kind of denoising method is scarce in document-level RE. In this work, we explicitly model a denoised document-level graph based on linguistic knowledge to capture various long-range semantic dependencies among entities. We first formalize a Syntactic Dependency Tree forest (SDT-forest) by introducing the syntax and discourse dependency relation. Then, the Steiner tree algorithm extracts a mention-level denoised graph, Steiner Graph (SG), removing linguistically irrelevant words from the SDT-forest. We then devise a slide residual attention to highlight word-level evidence on text and SG. Finally, the classification is established on the SG to infer the relations of entity pairs. We conduct extensive experiments on three public datasets. The results evidence that our method is beneficial to establish long-range semantic dependency and can improve the classification performance with longer texts.
{"title":"Denoising Graph Inference Network for Document-Level Relation Extraction","authors":"Hailin Wang;Ke Qin;Guiduo Duan;Guangchun Luo","doi":"10.26599/BDMA.2022.9020051","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020051","url":null,"abstract":"Relation Extraction (RE) is to obtain a predefined relation type of two entities mentioned in a piece of text, e.g., a sentence-level or a document-level text. Most existing studies suffer from the noise in the text, and necessary pruning is of great importance. The conventional sentence-level RE task addresses this issue by a denoising method using the shortest dependency path to build a long-range semantic dependency between entity pairs. However, this kind of denoising method is scarce in document-level RE. In this work, we explicitly model a denoised document-level graph based on linguistic knowledge to capture various long-range semantic dependencies among entities. We first formalize a Syntactic Dependency Tree forest (SDT-forest) by introducing the syntax and discourse dependency relation. Then, the Steiner tree algorithm extracts a mention-level denoised graph, Steiner Graph (SG), removing linguistically irrelevant words from the SDT-forest. We then devise a slide residual attention to highlight word-level evidence on text and SG. Finally, the classification is established on the SG to infer the relations of entity pairs. We conduct extensive experiments on three public datasets. The results evidence that our method is beneficial to establish long-range semantic dependency and can improve the classification performance with longer texts.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 2","pages":"248-262"},"PeriodicalIF":13.6,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10026288/10026508.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67984889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}