2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...最新文献
Pub Date : 2020-08-01DOI: 10.1109/IRI49571.2020.00069
Shih-Hung Wu, Yi-Kun Chen
Customer reviews provide helpful information such as usage experiences or critiques; these are critical information resource for future customers. Since the amount of online review is getting bigger, people need a way to find the most helpful ones automatically. Previous studies addressed on the prediction of the percentage of the helpfulness voting results based on a regression model or classified them into a helpful or unhelpful classes. However, the voting result of an online review is not a constant over time, and we also find that there are many reviews getting zero vote. Therefore, we collect the voting results of the same online customer reviews over time, and observe the change of votes to find a better learning target. We collected a dataset with online reviews in five different product categories (“Apple”, “Video Game”, “Clothing, Shoes & Jewelry”, “Sports & Outdoors”, and “Prime Video”) from Amazon.com with the voting result on the helpfulness of the reviews, and monitor the helpfulness voting for six weeks. Experiments are conducted on the dataset to get a reasonable classification on the zero and non-zero vote reviews. We construct a classification system that can classify the online reviews via the deep learning model BERT. The results show that the classifier can get good result on the helpfulness prediction. We also test the classifier on cross-domain prediction and get promising results.
{"title":"Cross-Domain Helpfulness Prediction of Online Consumer Reviews by Deep Learning Model","authors":"Shih-Hung Wu, Yi-Kun Chen","doi":"10.1109/IRI49571.2020.00069","DOIUrl":"https://doi.org/10.1109/IRI49571.2020.00069","url":null,"abstract":"Customer reviews provide helpful information such as usage experiences or critiques; these are critical information resource for future customers. Since the amount of online review is getting bigger, people need a way to find the most helpful ones automatically. Previous studies addressed on the prediction of the percentage of the helpfulness voting results based on a regression model or classified them into a helpful or unhelpful classes. However, the voting result of an online review is not a constant over time, and we also find that there are many reviews getting zero vote. Therefore, we collect the voting results of the same online customer reviews over time, and observe the change of votes to find a better learning target. We collected a dataset with online reviews in five different product categories (“Apple”, “Video Game”, “Clothing, Shoes & Jewelry”, “Sports & Outdoors”, and “Prime Video”) from Amazon.com with the voting result on the helpfulness of the reviews, and monitor the helpfulness voting for six weeks. Experiments are conducted on the dataset to get a reasonable classification on the zero and non-zero vote reviews. We construct a classification system that can classify the online reviews via the deep learning model BERT. The results show that the classifier can get good result on the helpfulness prediction. We also test the classifier on cross-domain prediction and get promising results.","PeriodicalId":93159,"journal":{"name":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","volume":"43 1","pages":"412-418"},"PeriodicalIF":0.0,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80942078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-08-01DOI: 10.1109/IRI49571.2020.00025
Mahsa Amirkhani, Narges Manouchehri, N. Bouguila
Mixture models have been widely used as statistical learning paradigms in various unsupervised machine learning applications, where labeling a vast amount of data is impractical and costly. They have shown a significant success and convincing performance in many real-world problems such as medical applications, image clustering and anomaly detection. In this paper, we explore a fully Bayesian analysis of multivariate Beta mixture model and propose a solution for the problem of estimating parameters using Markov Chain Monte Carlo technique. We exploit Gibbs sampling within Metropolis-Hastings for Monte Carlo simulation. We also obtained prior distribution which is a conjugate for multivariate Beta. The performance of our proposed method is evaluated and compared with Bayesian Gaussian mixture model via challenging applications, including cell image categorization and network intrusion detection. Experimental results confirm that the proposed technique can provide an effective solution comparing to similar alternatives.
{"title":"Fully Bayesian Learning of Multivariate Beta Mixture Models","authors":"Mahsa Amirkhani, Narges Manouchehri, N. Bouguila","doi":"10.1109/IRI49571.2020.00025","DOIUrl":"https://doi.org/10.1109/IRI49571.2020.00025","url":null,"abstract":"Mixture models have been widely used as statistical learning paradigms in various unsupervised machine learning applications, where labeling a vast amount of data is impractical and costly. They have shown a significant success and convincing performance in many real-world problems such as medical applications, image clustering and anomaly detection. In this paper, we explore a fully Bayesian analysis of multivariate Beta mixture model and propose a solution for the problem of estimating parameters using Markov Chain Monte Carlo technique. We exploit Gibbs sampling within Metropolis-Hastings for Monte Carlo simulation. We also obtained prior distribution which is a conjugate for multivariate Beta. The performance of our proposed method is evaluated and compared with Bayesian Gaussian mixture model via challenging applications, including cell image categorization and network intrusion detection. Experimental results confirm that the proposed technique can provide an effective solution comparing to similar alternatives.","PeriodicalId":93159,"journal":{"name":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","volume":"140 1","pages":"120-127"},"PeriodicalIF":0.0,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74901130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-08-01DOI: 10.1109/IRI49571.2020.00060
Mehdi Fasihi, M. Nadimi-Shahraki, A. Jannesari
The electrocardiogram (ECG) is an important signal in the health informatics for the detection of cardiac abnormalities. There have been several researches on using machine learning techniques for analyzing ECG. However, they need additional computation owning to ECG signals challenges. We introduce a new architecture of 1-D convolution neural network (CNN) to diagnose arrhythmia diseases automatically. The proposed architecture consists of four convolution layers, three pooling layers, and three fully connected layers evaluated on the arrhythmia dataset. All previous researches are conducted to classify healthy people from people with Arrhythmia disease. In this paper, we propose to go further multiclass classification with two classes of cardiac diseases and one class of healthy people. The results are compared with common 1-D CNN and seven different classifiers. The experimental results demonstrate that the proposed architecture is superior to existing classifiers and also competitive with state of the art in terms of accuracy.
{"title":"Multi-Class Cardiovascular Diseases Diagnosis from Electrocardiogram Signals using 1-D Convolution Neural Network","authors":"Mehdi Fasihi, M. Nadimi-Shahraki, A. Jannesari","doi":"10.1109/IRI49571.2020.00060","DOIUrl":"https://doi.org/10.1109/IRI49571.2020.00060","url":null,"abstract":"The electrocardiogram (ECG) is an important signal in the health informatics for the detection of cardiac abnormalities. There have been several researches on using machine learning techniques for analyzing ECG. However, they need additional computation owning to ECG signals challenges. We introduce a new architecture of 1-D convolution neural network (CNN) to diagnose arrhythmia diseases automatically. The proposed architecture consists of four convolution layers, three pooling layers, and three fully connected layers evaluated on the arrhythmia dataset. All previous researches are conducted to classify healthy people from people with Arrhythmia disease. In this paper, we propose to go further multiclass classification with two classes of cardiac diseases and one class of healthy people. The results are compared with common 1-D CNN and seven different classifiers. The experimental results demonstrate that the proposed architecture is superior to existing classifiers and also competitive with state of the art in terms of accuracy.","PeriodicalId":93159,"journal":{"name":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","volume":"33 1","pages":"372-378"},"PeriodicalIF":0.0,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73696595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-08-01DOI: 10.1109/IRI49571.2020.00021
Ankit Srivastava, Samira Pouyanfar, Joshua Allen, Ken Johnston, Qida Ma
Computation of Mutual Information (MI) helps understand the amount of information shared between a pair of random variables. Automated feature selection techniques based on MI ranking are regularly used to extract information from sensitive datasets exceeding petabytes in size, over millions of features and classes. Series of one-vs-all MI computations can be cascaded to produce n-fold MI results, rapidly pinpointing informative relationships. This ability to quickly pinpoint the most informative relationships from datasets of billions of users creates privacy concerns. In this paper, we present Distributed Differentially Private Mutual Information (DDP-MI), a privacy-safe fast batch MI, across various scenarios such as feature selection, segmentation, ranking, and query expansion. This distributed implementation is protected with global model differential privacy to provide strong assurances against a wide range of privacy attacks. We also show that our DDP-MI can substantially improve the efficiency of MI calculations compared to standard implementations on a large-scale public dataset.
{"title":"Distributed Differentially Private Mutual Information Ranking and Its Applications","authors":"Ankit Srivastava, Samira Pouyanfar, Joshua Allen, Ken Johnston, Qida Ma","doi":"10.1109/IRI49571.2020.00021","DOIUrl":"https://doi.org/10.1109/IRI49571.2020.00021","url":null,"abstract":"Computation of Mutual Information (MI) helps understand the amount of information shared between a pair of random variables. Automated feature selection techniques based on MI ranking are regularly used to extract information from sensitive datasets exceeding petabytes in size, over millions of features and classes. Series of one-vs-all MI computations can be cascaded to produce n-fold MI results, rapidly pinpointing informative relationships. This ability to quickly pinpoint the most informative relationships from datasets of billions of users creates privacy concerns. In this paper, we present Distributed Differentially Private Mutual Information (DDP-MI), a privacy-safe fast batch MI, across various scenarios such as feature selection, segmentation, ranking, and query expansion. This distributed implementation is protected with global model differential privacy to provide strong assurances against a wide range of privacy attacks. We also show that our DDP-MI can substantially improve the efficiency of MI calculations compared to standard implementations on a large-scale public dataset.","PeriodicalId":93159,"journal":{"name":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","volume":"4 1","pages":"90-96"},"PeriodicalIF":0.0,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82039517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-08-01DOI: 10.1109/iri49571.2020.00004
Rashmi Jha, David Kapp, Thuong Khanh Tran
{"title":"IRI 2020 TOC","authors":"Rashmi Jha, David Kapp, Thuong Khanh Tran","doi":"10.1109/iri49571.2020.00004","DOIUrl":"https://doi.org/10.1109/iri49571.2020.00004","url":null,"abstract":"","PeriodicalId":93159,"journal":{"name":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73830219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-08-01DOI: 10.1109/IRI49571.2020.00047
Abbas Keshavarzi, K. Kochut
A Knowledge Graph (KG) is a machine-readable, labeled graph-like representation of human knowledge. As the main goal of KG is to represent data by enriching it with computer-processable semantics, the knowledge graph creation usually involves acquiring data from external resources and datasets. In many domains, especially in biomedicine, the data sources continuously evolve, and KG engineers and domain experts must not only track the changes in KG entities and their interconnections but introduce changes to the KG schema and the graph population software. We present a framework to track the KG evolution both in terms of the schema and individuals. KGdiff is a software tool that incrementally collects the relevant meta-data information from a KG and compares it to a prior version the KG. The KG is represented in OWL/RDF/RDFS and the meta-data is collected using domain-independent queries. We evaluate our method on different RDF/OWL data sets (ontologies).
{"title":"KGdiff: Tracking the Evolution of Knowledge Graphs","authors":"Abbas Keshavarzi, K. Kochut","doi":"10.1109/IRI49571.2020.00047","DOIUrl":"https://doi.org/10.1109/IRI49571.2020.00047","url":null,"abstract":"A Knowledge Graph (KG) is a machine-readable, labeled graph-like representation of human knowledge. As the main goal of KG is to represent data by enriching it with computer-processable semantics, the knowledge graph creation usually involves acquiring data from external resources and datasets. In many domains, especially in biomedicine, the data sources continuously evolve, and KG engineers and domain experts must not only track the changes in KG entities and their interconnections but introduce changes to the KG schema and the graph population software. We present a framework to track the KG evolution both in terms of the schema and individuals. KGdiff is a software tool that incrementally collects the relevant meta-data information from a KG and compares it to a prior version the KG. The KG is represented in OWL/RDF/RDFS and the meta-data is collected using domain-independent queries. We evaluate our method on different RDF/OWL data sets (ontologies).","PeriodicalId":93159,"journal":{"name":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","volume":"31 9 1","pages":"279-286"},"PeriodicalIF":0.0,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81635781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-08-01DOI: 10.1109/IRI49571.2020.00030
Michael Valdron, K. Pu
We propose a data-driven constraint programming environment that merges the power of two separate domains: databases and SAT-solvers. While a database system offers flexible data models and query languages, SAT solvers offer the ability to satisfy logical constraints and optimization objectives. In this paper, we describe a goal-oriented declarative algebra that seamlessly integrates both worlds. Bring from proven practices in functional programming, we express constants, variables and constraints in a unified relational query language. The language is implemented on top of industrial strength database engines and SAT solvers.In order to support iterative constraint programming with debugging, we propose several debugging operators to assist with interactive constraint solving.
{"title":"Data Driven Relational Constraint Programming","authors":"Michael Valdron, K. Pu","doi":"10.1109/IRI49571.2020.00030","DOIUrl":"https://doi.org/10.1109/IRI49571.2020.00030","url":null,"abstract":"We propose a data-driven constraint programming environment that merges the power of two separate domains: databases and SAT-solvers. While a database system offers flexible data models and query languages, SAT solvers offer the ability to satisfy logical constraints and optimization objectives. In this paper, we describe a goal-oriented declarative algebra that seamlessly integrates both worlds. Bring from proven practices in functional programming, we express constants, variables and constraints in a unified relational query language. The language is implemented on top of industrial strength database engines and SAT solvers.In order to support iterative constraint programming with debugging, we propose several debugging operators to assist with interactive constraint solving.","PeriodicalId":93159,"journal":{"name":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","volume":"8 1","pages":"156-163"},"PeriodicalIF":0.0,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91072677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-08-01DOI: 10.1109/IRI49571.2020.00013
Bo Ma, Jinsong Wu, William Liu, L. Chiaraviglio, Xing Ming
It is foreseeable the popularity of the mobile edge computing enabled infrastructure for wireless networks in the incoming fifth generation (5G) and future sixth generation (6G) wireless networks. Especially after a ‘hard’ disaster such as earthquakes or a ‘soft’ disaster such as COVID-19 pandemic, the existing telecommunication infrastructure, including wired and wireless networks, is often seriously compromised or with infectious disease risks and should-not-close-contact, thus cannot guarantee regular coverage and reliable communications services. These temporarily-missing communications capabilities are crucial to rescuers, health-carers, or affected or infected citizens as the responders need to effectively coordinate and communicate to minimize the loss of lives and property, where the 5G/6G mobile edge network helps. On the other hand, the federated machine learning (FML) methods have been newly developed to address the privacy leakage problems of the traditional machine learning held normally by one centralized organization, associated with the high risks of a single point of hacking. After detailing current state-of-the-art both in privacy-preserving, federated learning, and mobile edge communications networks for ‘hard’ and ‘soft’ disasters, we consider the main challenges that need to be faced. We envision a privacy-preserving federated learning enabled buses-and-drones based mobile edge infrastructure (ppFL-AidLife) for disaster or pandemic emergency communications. The ppFL-AidLife system aims at a rapidly deployable resilient network capable of supporting flexible, privacy-preserving and low-latency communications to serve large-scale disaster situations by utilizing the existing public transport networks, associated with drones to maximally extend their radio coverage to those hard-to-reach disasters or should-not-close-contact pandemic zones.
{"title":"Combating Hard or Soft Disasters with Privacy-Preserving Federated Mobile Buses-and-Drones based Networks","authors":"Bo Ma, Jinsong Wu, William Liu, L. Chiaraviglio, Xing Ming","doi":"10.1109/IRI49571.2020.00013","DOIUrl":"https://doi.org/10.1109/IRI49571.2020.00013","url":null,"abstract":"It is foreseeable the popularity of the mobile edge computing enabled infrastructure for wireless networks in the incoming fifth generation (5G) and future sixth generation (6G) wireless networks. Especially after a ‘hard’ disaster such as earthquakes or a ‘soft’ disaster such as COVID-19 pandemic, the existing telecommunication infrastructure, including wired and wireless networks, is often seriously compromised or with infectious disease risks and should-not-close-contact, thus cannot guarantee regular coverage and reliable communications services. These temporarily-missing communications capabilities are crucial to rescuers, health-carers, or affected or infected citizens as the responders need to effectively coordinate and communicate to minimize the loss of lives and property, where the 5G/6G mobile edge network helps. On the other hand, the federated machine learning (FML) methods have been newly developed to address the privacy leakage problems of the traditional machine learning held normally by one centralized organization, associated with the high risks of a single point of hacking. After detailing current state-of-the-art both in privacy-preserving, federated learning, and mobile edge communications networks for ‘hard’ and ‘soft’ disasters, we consider the main challenges that need to be faced. We envision a privacy-preserving federated learning enabled buses-and-drones based mobile edge infrastructure (ppFL-AidLife) for disaster or pandemic emergency communications. The ppFL-AidLife system aims at a rapidly deployable resilient network capable of supporting flexible, privacy-preserving and low-latency communications to serve large-scale disaster situations by utilizing the existing public transport networks, associated with drones to maximally extend their radio coverage to those hard-to-reach disasters or should-not-close-contact pandemic zones.","PeriodicalId":93159,"journal":{"name":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","volume":"158 1","pages":"31-36"},"PeriodicalIF":0.0,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86730072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-08-01DOI: 10.1109/iri49571.2020.00003
{"title":"IRI 2020 Breaker Page","authors":"","doi":"10.1109/iri49571.2020.00003","DOIUrl":"https://doi.org/10.1109/iri49571.2020.00003","url":null,"abstract":"","PeriodicalId":93159,"journal":{"name":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","volume":"43 4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83489308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-08-01DOI: 10.1109/IRI49571.2020.00039
Justin M. Johnson, T. Khoshgoftaar
A medical provider’s specialty is a significant predictor for detecting fraudulent providers with machine learning algorithms. When the specialty variable is encoded using a one-hot representation, however, models are subjected to sparse and uninformative feature vectors. We explore three techniques for representing medical provider types with dense, semantic embeddings that capture specialty similarities. The first two methods (GloVe and Med-Word2Vec) use pre-trained word embeddings to convert provider specialty descriptions to short phrase embeddings. Next, we propose a method for constructing semantic provider type embeddings from the procedure-level activity within each specialty group. For each embedding technique, we use Principal Component Analysis to compare the performance of embedding sizes between 32-128. Each embedding technique is evaluated on a highly imbalanced Medicare fraud prediction task using Logistic Regression (LR), Random Forest (RF), Gradient Boosted Tree (GBT), and Multilayer Perceptron (MLP) learners. Experiments are repeated 30 times and confidence intervals show that all three semantic embeddings significantly outperform one-hot representations when using RF and GBT learners. Our contributions include a novel method for embedding medical specialties from procedure codes and a comparison of three semantic embedding techniques for Medicare fraud detection.
{"title":"Semantic Embeddings for Medical Providers and Fraud Detection","authors":"Justin M. Johnson, T. Khoshgoftaar","doi":"10.1109/IRI49571.2020.00039","DOIUrl":"https://doi.org/10.1109/IRI49571.2020.00039","url":null,"abstract":"A medical provider’s specialty is a significant predictor for detecting fraudulent providers with machine learning algorithms. When the specialty variable is encoded using a one-hot representation, however, models are subjected to sparse and uninformative feature vectors. We explore three techniques for representing medical provider types with dense, semantic embeddings that capture specialty similarities. The first two methods (GloVe and Med-Word2Vec) use pre-trained word embeddings to convert provider specialty descriptions to short phrase embeddings. Next, we propose a method for constructing semantic provider type embeddings from the procedure-level activity within each specialty group. For each embedding technique, we use Principal Component Analysis to compare the performance of embedding sizes between 32-128. Each embedding technique is evaluated on a highly imbalanced Medicare fraud prediction task using Logistic Regression (LR), Random Forest (RF), Gradient Boosted Tree (GBT), and Multilayer Perceptron (MLP) learners. Experiments are repeated 30 times and confidence intervals show that all three semantic embeddings significantly outperform one-hot representations when using RF and GBT learners. Our contributions include a novel method for embedding medical specialties from procedure codes and a comparison of three semantic embedding techniques for Medicare fraud detection.","PeriodicalId":93159,"journal":{"name":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","volume":"42 1","pages":"224-230"},"PeriodicalIF":0.0,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84713156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...