2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...最新文献_第3页

Predicting PTSD Severity in Veterans from Self-reports for Early Intervention: A Machine Learning Approach 预测退伍军人创伤后应激障碍严重程度的自我报告早期干预:一种机器学习方法

2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...

Pub Date : 2020-08-01 DOI: 10.1109/IRI49571.2020.00036

Priyanka Annapureddy, Md Fitrat Hossain, Thomas Kissane, Wylie Frydrychowicz, Paromita Nitu, Joseph Coelho, Nadiyah Johnson, P. Madiraju, Zeno Franco, Katinka Hooyer, Niharika Jain, M. Flower, Sheikh Iqbal Ahamed

Early intervention for veterans in crisis represents a crucial area of study to reduce the psychological and health burdens for this population. Traumatic experiences associated with military service are associated with drug and alcohol abuse, suicidality, anger, and disrupted work and family relationships. This project used machine learning (ML) models to integrate data from sociodemographic, self-report baseline symptoms, weekly brief Ecological momentary assessment (EMA) survey of veterans in a community-based 12-week peer support program to predict the discharge PTSD severity level. The ML predictions place the participants into one of the three risk levels: low, medium, and high PCL-5 score. The models were evaluated at different timepoints (weekly intervals) of the program for identifying the earliest week to guide early intervention and reduce veterans’ engagement in risky behaviors. The best results were achieved from a voting classifier with an average f-score of 0.69 at week 4.

对处于危机中的退伍军人进行早期干预是减少这一人群心理和健康负担的一个重要研究领域。与服兵役有关的创伤经历与药物和酒精滥用、自杀、愤怒以及工作和家庭关系中断有关。该项目使用机器学习(ML)模型整合社会人口学数据，自我报告基线症状，每周简短的生态瞬时评估(EMA)调查退伍军人在一个基于社区的12周同伴支持计划，以预测退伍后创伤后应激障碍严重程度。ML预测将参与者置于三个风险水平之一:低、中、高PCL-5评分。这些模型在项目的不同时间点(每周间隔)进行评估，以确定最早的一周，指导早期干预，减少退伍军人参与危险行为。投票分类器在第4周的平均f值为0.69，获得了最好的结果。

{"title":"Predicting PTSD Severity in Veterans from Self-reports for Early Intervention: A Machine Learning Approach","authors":"Priyanka Annapureddy, Md Fitrat Hossain, Thomas Kissane, Wylie Frydrychowicz, Paromita Nitu, Joseph Coelho, Nadiyah Johnson, P. Madiraju, Zeno Franco, Katinka Hooyer, Niharika Jain, M. Flower, Sheikh Iqbal Ahamed","doi":"10.1109/IRI49571.2020.00036","DOIUrl":"https://doi.org/10.1109/IRI49571.2020.00036","url":null,"abstract":"Early intervention for veterans in crisis represents a crucial area of study to reduce the psychological and health burdens for this population. Traumatic experiences associated with military service are associated with drug and alcohol abuse, suicidality, anger, and disrupted work and family relationships. This project used machine learning (ML) models to integrate data from sociodemographic, self-report baseline symptoms, weekly brief Ecological momentary assessment (EMA) survey of veterans in a community-based 12-week peer support program to predict the discharge PTSD severity level. The ML predictions place the participants into one of the three risk levels: low, medium, and high PCL-5 score. The models were evaluated at different timepoints (weekly intervals) of the program for identifying the earliest week to guide early intervention and reduce veterans’ engagement in risky behaviors. The best results were achieved from a voting classifier with an average f-score of 0.69 at week 4.","PeriodicalId":93159,"journal":{"name":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","volume":"34 1","pages":"201-208"},"PeriodicalIF":0.0,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87502409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Dynamic image for micro-expression recognition on region-based framework 基于区域框架的动态图像微表情识别

2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...

Pub Date : 2020-08-01 DOI: 10.1109/IRI49571.2020.00019

T. Le, T. Tran, M. Rege

Facial micro-expressions are involuntary facial expressions with low intensity and short duration natures in which hidden emotions can be revealed. Micro-expression analysis has been increasingly received tremendous attention and become advanced in the field of computer vision. However, it appears to be very challenging and requires resources to a greater extent to study micro-expressions. Most of the recent works have attempted to improve the spontaneous facial micro-expression recognition with sophisticated and hand-crafted feature extraction techniques. The use of deep neural networks has also been adopted to leverage this task. In this paper, we present a compact framework where a rank pooling concept called dynamic image is employed as a descriptor to extract informative features on certain regions of interests along with a convolutional neural network (CNN) deployed on elicited dynamic images to recognize micro-expressions therein. Particularly, facial motion magnification technique is applied on input sequences to enhance the magnitude of facial movements in the data. Subsequently, rank pooling is implemented to attain dynamic images. Only a fixed number of localized facial areas are extracted on the dynamic images based on observed dominant muscular changes. CNN models are fit to the final feature representation for emotion classification task. The framework is simple compared to that of other findings, yet the logic behind it justifies the effectiveness by the experimental results we achieved throughout the study. The experiment is evaluated on three state-of-the-art databases CASMEII, SMIC and SAMM.

面部微表情是一种无意识的、强度低、持续时间短的面部表情，可以表现出隐藏的情绪。微表情分析越来越受到人们的广泛关注，成为计算机视觉领域的前沿技术。然而，研究微表情似乎非常具有挑战性，需要更大程度的资源。近年来，大多数研究都试图利用复杂的人工特征提取技术来提高面部微表情的自然识别能力。深度神经网络的使用也被用来利用这项任务。在本文中，我们提出了一个紧凑的框架，其中使用称为动态图像的秩池概念作为描述符来提取某些感兴趣区域的信息特征，并在提取的动态图像上部署卷积神经网络(CNN)来识别其中的微表情。特别地，在输入序列上应用了面部运动放大技术来增强数据中面部运动的幅度。随后，实现秩池化，获得动态图像。基于观察到的显性肌肉变化，在动态图像上只提取固定数量的局部面部区域。CNN模型适合于情感分类任务的最终特征表示。与其他研究结果相比，该框架很简单，但其背后的逻辑证明了我们在整个研究过程中获得的实验结果的有效性。在CASMEII、SMIC和SAMM三个最先进的数据库上对实验进行了评估。

{"title":"Dynamic image for micro-expression recognition on region-based framework","authors":"T. Le, T. Tran, M. Rege","doi":"10.1109/IRI49571.2020.00019","DOIUrl":"https://doi.org/10.1109/IRI49571.2020.00019","url":null,"abstract":"Facial micro-expressions are involuntary facial expressions with low intensity and short duration natures in which hidden emotions can be revealed. Micro-expression analysis has been increasingly received tremendous attention and become advanced in the field of computer vision. However, it appears to be very challenging and requires resources to a greater extent to study micro-expressions. Most of the recent works have attempted to improve the spontaneous facial micro-expression recognition with sophisticated and hand-crafted feature extraction techniques. The use of deep neural networks has also been adopted to leverage this task. In this paper, we present a compact framework where a rank pooling concept called dynamic image is employed as a descriptor to extract informative features on certain regions of interests along with a convolutional neural network (CNN) deployed on elicited dynamic images to recognize micro-expressions therein. Particularly, facial motion magnification technique is applied on input sequences to enhance the magnitude of facial movements in the data. Subsequently, rank pooling is implemented to attain dynamic images. Only a fixed number of localized facial areas are extracted on the dynamic images based on observed dominant muscular changes. CNN models are fit to the final feature representation for emotion classification task. The framework is simple compared to that of other findings, yet the logic behind it justifies the effectiveness by the experimental results we achieved throughout the study. The experiment is evaluated on three state-of-the-art databases CASMEII, SMIC and SAMM.","PeriodicalId":93159,"journal":{"name":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","volume":"24 1","pages":"75-81"},"PeriodicalIF":0.0,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83569994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Forecasting Atmospheric Visibility Using Auto Regressive Recurrent Neural Network 利用自回归递归神经网络预测大气能见度

2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...

Pub Date : 2020-08-01 DOI: 10.1109/IRI49571.2020.00037

Jahnavi Jonnalagadda, M. Hashemi

Atmospheric visibility conditions not only affect traffic on roads, but also aviation operations. Poor visibility at the destination site can reduce airport capacity leading to ground delays, flight cancellations, flight diversions, and extra operating costs. Hence, timely forecast of visibility is important for safe operation in both airports and highways. Visibility is affected by meteorological weather variables such as precipitation, temperature, wind speed, humidity, smoke, fog, mist, and Particulate Matter (PM) concentrations in the atmosphere. This paper is an effort to forecast univariate weather variable visibility and explore the effect of highly correlated meteorological weather variables on visibility, using an Auto Regressive Recurrent Neural Network (ARRNN). By adjusting the number of epochs and the regression horizon, i.e. past time steps used in visibility prediction, we showed that ARRNN outperforms long-short term memory (LSTM) networks and vanilla recurrent neural network (Vanilla RNN) in terms of coefficient of determination (R2).

大气能见度不仅影响道路交通，也影响航空运营。目的地能见度低会降低机场容量，导致地面延误、航班取消、航班备降和额外的运营成本。因此，及时的能见度预报对机场和高速公路的安全运行至关重要。能见度受气象天气变量的影响，如降水、温度、风速、湿度、大气中的烟、雾、薄雾和颗粒物(PM)浓度。本文利用自回归递归神经网络(ARRNN)对单变量天气变量能见度进行预测，并探讨高度相关的气象天气变量对能见度的影响。通过调整epoch数和回归视界(即用于可见性预测的过去时间步长)，我们发现ARRNN在决定系数(R2)方面优于长短期记忆(LSTM)网络和vanilla递归神经网络(vanilla RNN)。

{"title":"Forecasting Atmospheric Visibility Using Auto Regressive Recurrent Neural Network","authors":"Jahnavi Jonnalagadda, M. Hashemi","doi":"10.1109/IRI49571.2020.00037","DOIUrl":"https://doi.org/10.1109/IRI49571.2020.00037","url":null,"abstract":"Atmospheric visibility conditions not only affect traffic on roads, but also aviation operations. Poor visibility at the destination site can reduce airport capacity leading to ground delays, flight cancellations, flight diversions, and extra operating costs. Hence, timely forecast of visibility is important for safe operation in both airports and highways. Visibility is affected by meteorological weather variables such as precipitation, temperature, wind speed, humidity, smoke, fog, mist, and Particulate Matter (PM) concentrations in the atmosphere. This paper is an effort to forecast univariate weather variable visibility and explore the effect of highly correlated meteorological weather variables on visibility, using an Auto Regressive Recurrent Neural Network (ARRNN). By adjusting the number of epochs and the regression horizon, i.e. past time steps used in visibility prediction, we showed that ARRNN outperforms long-short term memory (LSTM) networks and vanilla recurrent neural network (Vanilla RNN) in terms of coefficient of determination (R2).","PeriodicalId":93159,"journal":{"name":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","volume":"13 1","pages":"209-215"},"PeriodicalIF":0.0,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88736007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

NLP Relational Queries and Its Application NLP关系查询及其应用

2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...

Pub Date : 2020-08-01 DOI: 10.1109/IRI49571.2020.00064

Andrei Stoica, K. Pu, Heidar Davoudi

Recent advances in natural language processing have shown the effectiveness of statistical and neural networkbased algorithms in a deep understanding of textual data. We demonstrate that the result of NLP analysis on text documents can enrich relational data in a way so that structured queries can be used to derive further value from text data. In this paper, we present how we can perform analytics on a scientific research dataset based on both the relational data and NLP topic modeling. The integrated NLP features together with the classical relational query constructs allow one to explore the topic structure of the DBLP dataset with flexibility and precision.

自然语言处理的最新进展表明，基于统计和神经网络的算法在深入理解文本数据方面是有效的。我们证明了文本文档的NLP分析结果可以以某种方式丰富关系数据，以便结构化查询可以用于从文本数据中获得进一步的价值。在本文中，我们介绍了如何在关系数据和NLP主题建模的基础上对科研数据集进行分析。集成的NLP特征与经典的关系查询结构一起允许人们灵活而精确地探索DBLP数据集的主题结构。

引用次数: 1

Global Land Temperature Forecasting Using Long Short-Term Memory Network 利用长短期记忆网络预测全球陆地温度

2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...

Pub Date : 2020-08-01 DOI: 10.1109/IRI49571.2020.00038

Prashanti Maktala, M. Hashemi

Based on NASA’s 40 years of satellite data, earth has experienced drastic climatic changes in the form of sea-level rise, an increase in oceanic and atmospheric temperatures, depletion of the Ozone layer, and decrease in sea ice and snow cover. These observations point to the fact that the world is getting warmer, which significantly impacts humans and ecological systems. Forecasting global land temperature could help to identify the extent of devasting consequences on the natural habitat and shed light on the impact of policies, designed to mitigate them. Previous studies have attempted to forecast regional temperatures using traditional machine learning models. This paper uses a standard multi-layer perceptron, a simple Recurrent Neural Network, and a Long Short-Term Memory network to forecast next month’s global land temperature. Our results show that deep learning outperforms traditional machine learning models, including decision tree, random forest, and ridge regression.

根据美国国家航空航天局40年的卫星数据，地球经历了剧烈的气候变化，表现为海平面上升、海洋和大气温度升高、臭氧层耗竭、海冰和积雪减少。这些观察结果表明，世界正在变暖，这对人类和生态系统产生了重大影响。预测全球陆地温度可以帮助确定对自然栖息地的破坏性后果的程度，并阐明旨在减轻这些后果的政策的影响。之前的研究试图使用传统的机器学习模型来预测区域温度。本文使用一个标准的多层感知器、一个简单的循环神经网络和一个长短期记忆网络来预测下个月的全球陆地温度。我们的研究结果表明，深度学习优于传统的机器学习模型，包括决策树、随机森林和山脊回归。

{"title":"Global Land Temperature Forecasting Using Long Short-Term Memory Network","authors":"Prashanti Maktala, M. Hashemi","doi":"10.1109/IRI49571.2020.00038","DOIUrl":"https://doi.org/10.1109/IRI49571.2020.00038","url":null,"abstract":"Based on NASA’s 40 years of satellite data, earth has experienced drastic climatic changes in the form of sea-level rise, an increase in oceanic and atmospheric temperatures, depletion of the Ozone layer, and decrease in sea ice and snow cover. These observations point to the fact that the world is getting warmer, which significantly impacts humans and ecological systems. Forecasting global land temperature could help to identify the extent of devasting consequences on the natural habitat and shed light on the impact of policies, designed to mitigate them. Previous studies have attempted to forecast regional temperatures using traditional machine learning models. This paper uses a standard multi-layer perceptron, a simple Recurrent Neural Network, and a Long Short-Term Memory network to forecast next month’s global land temperature. Our results show that deep learning outperforms traditional machine learning models, including decision tree, random forest, and ridge regression.","PeriodicalId":93159,"journal":{"name":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","volume":"43 1","pages":"216-223"},"PeriodicalIF":0.0,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85098533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Medicare Fraud Detection using CatBoost 使用CatBoost检测医疗保险欺诈

2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...

Pub Date : 2020-08-01 DOI: 10.1109/IRI49571.2020.00022

John T. Hancock, T. Khoshgoftaar

In this study we investigate the performance of CatBoost in the task of identifying Medicare fraud. The Medicare claims data we use as input for CatBoost contain a number of categorical features. Some of these features, such as the procedure code and provider zip code, have thousands of possible values. One contribution we make in this study is to show how we use CatBoost to eliminate some data pre-processing steps that authors of related works take. A second contribution we make is to show improvements in CatBoost’s performance in terms of Area Under the Receiver Operating Characteristic Curve (AUC), when we include another one of the categorical features (provider state) as input to CatBoost. We show that CatBoost attains better performance than XGBoost in the task of Medicare fraud detection with respect to the AUC metric. At a 99% confidence level (with p-value 0) our experiments show that XGBoost obtains a mean AUC value of 0.7615 while CatBoost obtains a mean AUC value of 0.7851, validating the significance of CatBoost’s performance improvement over XGBoost. Moreover, when we include an additional categorical feature (healthcare provider state) in our data analysis, CatBoost yields a mean AUC value of 0.8902, which is also statistically signficant at a 99% confidence interval level (with p-value 0). Our empirical evidence clearly indicates CatBoost is a better alternative to XGBoost for Medicare fraud detection, especially when dealing with categorical features.

在这项研究中，我们调查了CatBoost在识别医疗保险欺诈任务中的表现。我们用作CatBoost输入的医疗保险索赔数据包含许多分类特征。其中一些特性(如过程代码和提供者邮政编码)有数千个可能的值。我们在这项研究中的一个贡献是展示了我们如何使用CatBoost来消除相关工作的作者所采取的一些数据预处理步骤。我们所做的第二个贡献是，当我们将另一个分类特征(提供者状态)作为CatBoost的输入时，在接收器工作特性曲线(AUC)下的面积方面显示了CatBoost性能的改进。我们表明CatBoost在AUC度量方面在医疗欺诈检测任务中比XGBoost获得了更好的性能。在99%的置信水平(p值为0)下，我们的实验表明，XGBoost获得的平均AUC值为0.7615，而CatBoost获得的平均AUC值为0.7851，验证了CatBoost相对于XGBoost的性能改进的重要性。此外，当我们在数据分析中加入一个额外的分类特征(医疗保健提供者状态)时，CatBoost的平均AUC值为0.8902，在99%的置信区间水平上(p值为0)也具有统计显著性。我们的经验证据清楚地表明，CatBoost是XGBoost更好的医疗欺诈检测替代方案，特别是在处理分类特征时。

{"title":"Medicare Fraud Detection using CatBoost","authors":"John T. Hancock, T. Khoshgoftaar","doi":"10.1109/IRI49571.2020.00022","DOIUrl":"https://doi.org/10.1109/IRI49571.2020.00022","url":null,"abstract":"In this study we investigate the performance of CatBoost in the task of identifying Medicare fraud. The Medicare claims data we use as input for CatBoost contain a number of categorical features. Some of these features, such as the procedure code and provider zip code, have thousands of possible values. One contribution we make in this study is to show how we use CatBoost to eliminate some data pre-processing steps that authors of related works take. A second contribution we make is to show improvements in CatBoost’s performance in terms of Area Under the Receiver Operating Characteristic Curve (AUC), when we include another one of the categorical features (provider state) as input to CatBoost. We show that CatBoost attains better performance than XGBoost in the task of Medicare fraud detection with respect to the AUC metric. At a 99% confidence level (with p-value 0) our experiments show that XGBoost obtains a mean AUC value of 0.7615 while CatBoost obtains a mean AUC value of 0.7851, validating the significance of CatBoost’s performance improvement over XGBoost. Moreover, when we include an additional categorical feature (healthcare provider state) in our data analysis, CatBoost yields a mean AUC value of 0.8902, which is also statistically signficant at a 99% confidence interval level (with p-value 0). Our empirical evidence clearly indicates CatBoost is a better alternative to XGBoost for Medicare fraud detection, especially when dealing with categorical features.","PeriodicalId":93159,"journal":{"name":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","volume":"32 1","pages":"97-103"},"PeriodicalIF":0.0,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86600576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Message from the Program Co-Chairs - IRI 2020 项目联合主席致辞- IRI 2020

2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...

Pub Date : 2020-08-01 DOI: 10.1109/iri49571.2020.00006

引用次数: 0

Latent Feature Modelling for Recommender Systems 推荐系统的潜在特征建模

2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...

Pub Date : 2020-08-01 DOI: 10.1109/IRI49571.2020.00057

Abdullah Alhejaili, S. Fatima

Matrix factorization is one of the most successful model-based collaborative filtering approaches in recommender systems. Nevertheless, useful latent user features can lead to a more accurate recommendation. However, user privacy and cross-domains access restrictions challenge collection and analysis of such information. In this study, we propose a feature extraction method (WAFE) which leverages user-item interaction history to extract useful latent user features. We also propose a rating prediction approach that incorporates the local mean of users’ and items’ ratings. We evaluate our proposed model using two real-world benchmark datasets and compare its performance against the state-of-the-art matrix factorization collaborative filtering methods. Evaluation results show that proposed method outperforms the existing methods.

矩阵分解是推荐系统中最成功的基于模型的协同过滤方法之一。然而，有用的潜在用户特性可以带来更准确的推荐。然而，用户隐私和跨域访问限制对这些信息的收集和分析提出了挑战。在这项研究中，我们提出了一种特征提取方法(WAFE)，它利用用户-项目交互历史来提取有用的潜在用户特征。我们还提出了一种结合用户和项目评分的局部平均值的评分预测方法。我们使用两个真实世界的基准数据集来评估我们提出的模型，并将其性能与最先进的矩阵分解协同过滤方法进行比较。评价结果表明，该方法优于现有方法。

引用次数: 2

Natural Language-based Integration of Online Review Datasets for Identification of Sex Trafficking Businesses. 基于自然语言的在线评论数据集集成，用于识别性交易业务。

2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...

Pub Date : 2020-08-01 Epub Date: 2020-09-10 DOI: 10.1109/iri49571.2020.00044

Maria Diaz, Anand Panangadan

There is increasing interest in automatically identifying advertisements related to sex trafficking in online review sites. The main challenge is to identify the changing patterns in text reviews that are used to indicate illegal businesses. This work describes a novel means of identifying illegal business advertisements using natural language processing and machine learning. The method relies on building a training set of reviews of known illegal businesses. This training data is created by integrating a small high precision set of known illegal businesses (Rubmaps) with a large collection of online reviews from a general purpose review site (Yelp). Standard natural language pre-processing techniques are then applied to the text reviews and converted into a bag-of-words model with Term frequency-inverse document weighting. The resulting Document-Term matrix is used to train a classifier and then to identify suspicious activity from the remaining reviews. This approach therefore leverages a high-precision, low-recall dataset to identify relevant instances from the large low-precision, high-recall dataset. The approach was evaluated on a collection of 456,050 reviews from the Yelp online forum with a variety of machine learning algorithms and different number of text features. The method achieved a f1-score of 0.77 with a random forests classifier. The number of text features could also be reduced from 1,473 to 447 for a compact classifier with only a small drop in accuracy.

人们对自动识别在线评论网站上与性交易有关的广告越来越感兴趣。主要的挑战是识别用于指示非法业务的文本评论中的变化模式。这项工作描述了一种使用自然语言处理和机器学习识别非法商业广告的新方法。该方法依赖于建立一个对已知非法业务进行审查的训练集。这个训练数据是通过整合一个小的高精度的已知非法企业(Rubmaps)和一个通用评论网站(Yelp)的大量在线评论来创建的。然后将标准的自然语言预处理技术应用于文本评审，并将其转换为具有词频率逆文档权重的词袋模型。生成的Document-Term矩阵用于训练分类器，然后从剩余的评论中识别可疑活动。因此，这种方法利用高精度、低召回率的数据集，从大型低精度、高召回率的数据集中识别相关实例。该方法在来自Yelp在线论坛的456,050条评论上进行了评估，使用了各种机器学习算法和不同数量的文本特征。使用随机森林分类器，该方法的f1得分为0.77。对于一个紧凑的分类器，文本特征的数量也可以从1473减少到447，而准确率只有很小的下降。

{"title":"Natural Language-based Integration of Online Review Datasets for Identification of Sex Trafficking Businesses.","authors":"Maria Diaz, Anand Panangadan","doi":"10.1109/iri49571.2020.00044","DOIUrl":"https://doi.org/10.1109/iri49571.2020.00044","url":null,"abstract":"<p><p>There is increasing interest in automatically identifying advertisements related to sex trafficking in online review sites. The main challenge is to identify the changing patterns in text reviews that are used to indicate illegal businesses. This work describes a novel means of identifying illegal business advertisements using natural language processing and machine learning. The method relies on building a training set of reviews of known illegal businesses. This training data is created by integrating a small high precision set of known illegal businesses (Rubmaps) with a large collection of online reviews from a general purpose review site (Yelp). Standard natural language pre-processing techniques are then applied to the text reviews and converted into a bag-of-words model with Term frequency-inverse document weighting. The resulting Document-Term matrix is used to train a classifier and then to identify suspicious activity from the remaining reviews. This approach therefore leverages a high-precision, low-recall dataset to identify relevant instances from the large low-precision, high-recall dataset. The approach was evaluated on a collection of 456,050 reviews from the Yelp online forum with a variety of machine learning algorithms and different number of text features. The method achieved a f1-score of 0.77 with a random forests classifier. The number of text features could also be reduced from 1,473 to 447 for a compact classifier with only a small drop in accuracy.</p>","PeriodicalId":93159,"journal":{"name":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","volume":"2020 ","pages":"259-264"},"PeriodicalIF":0.0,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/iri49571.2020.00044","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39683511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Foreword - IRI 2020

2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...

Pub Date : 2020-08-01 DOI: 10.1109/iri49571.2020.00005

引用次数: 0