Predicting user behavior in web mining is an important concept with commercial implications. The user response to search engine results is crucial for understanding the relative popularity of websites and market trends. The most popular way of understanding user interests is via click models that can predict whether a user will click on a search engine result or not, based on past observations. There are two main categories of click models, namely, the neural network based models and the probabilistic graphical models. In this paper, we combine the goodness of both approaches by presenting a weighted ensemble of both types of models. The weighted sum of softmax scores integrates the predictions of the individual models. Assigning higher weights to the neural models is found to improve the performance of the ensemble. The AUC and perplexity scores of our weighted ensemble model are higher than the state of the art, as proved by experiments on the benchmark Tiangong-ST dataset.
{"title":"Weighted Ensemble of Neural and Probabilistic Graphical Models for Click Prediction","authors":"Kritarth Bisht, Seba Susan","doi":"10.1145/3471287.3471307","DOIUrl":"https://doi.org/10.1145/3471287.3471307","url":null,"abstract":"Predicting user behavior in web mining is an important concept with commercial implications. The user response to search engine results is crucial for understanding the relative popularity of websites and market trends. The most popular way of understanding user interests is via click models that can predict whether a user will click on a search engine result or not, based on past observations. There are two main categories of click models, namely, the neural network based models and the probabilistic graphical models. In this paper, we combine the goodness of both approaches by presenting a weighted ensemble of both types of models. The weighted sum of softmax scores integrates the predictions of the individual models. Assigning higher weights to the neural models is found to improve the performance of the ensemble. The AUC and perplexity scores of our weighted ensemble model are higher than the state of the art, as proved by experiments on the benchmark Tiangong-ST dataset.","PeriodicalId":306474,"journal":{"name":"2021 the 5th International Conference on Information System and Data Mining","volume":"46 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123519173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Karim Roca, Belinda Navarro, Hector Carlos, Edwin Delgado, M. Ore
The objective of this work is to determine the degree, direction and significance of the relationship that exists between transformational leadership and the organizational climate in teachers in public management educational institutions. The randomized stratified sample consisted of 120 teachers. The research had a quantitative approach, correlational type, and cross-sectional design. The information was collected with the Transformational Leadership Scale and the Organizational Climate Scale, on the other hand, the content validity and reliability of the instruments were corroborated according to the standards of the scientific community with the Aiken Validity coefficient, the Alpha coefficient of Cronbach and the Kuder-Richardson coefficient (KR-20), respectively.The statistical analysis of the data was done with the Stanonese scale for the description of the qualitative levels of the variables, and the parametric test Pearson's correlation coefficient (r) for the hypothesis test. The results showed direct correlations of moderate intensity; while the dimension of inspirational communication shows a low direct correlation with the organizational climate. Finally, the findings turned out to be statistically significant at a probability level of 0.05.
{"title":"Characterization of the organizational climate in public schools from the teacher's perception using the Estanones scale","authors":"Karim Roca, Belinda Navarro, Hector Carlos, Edwin Delgado, M. Ore","doi":"10.1145/3471287.3471310","DOIUrl":"https://doi.org/10.1145/3471287.3471310","url":null,"abstract":"The objective of this work is to determine the degree, direction and significance of the relationship that exists between transformational leadership and the organizational climate in teachers in public management educational institutions. The randomized stratified sample consisted of 120 teachers. The research had a quantitative approach, correlational type, and cross-sectional design. The information was collected with the Transformational Leadership Scale and the Organizational Climate Scale, on the other hand, the content validity and reliability of the instruments were corroborated according to the standards of the scientific community with the Aiken Validity coefficient, the Alpha coefficient of Cronbach and the Kuder-Richardson coefficient (KR-20), respectively.The statistical analysis of the data was done with the Stanonese scale for the description of the qualitative levels of the variables, and the parametric test Pearson's correlation coefficient (r) for the hypothesis test. The results showed direct correlations of moderate intensity; while the dimension of inspirational communication shows a low direct correlation with the organizational climate. Finally, the findings turned out to be statistically significant at a probability level of 0.05.","PeriodicalId":306474,"journal":{"name":"2021 the 5th International Conference on Information System and Data Mining","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131674107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chaity Banerjee, Chad Lilian, D. Reasor, E. Pasiliao, Tathagata Mukherjee
In this paper we propose a robust learning pipeline for inference in computational fluid dynamics (CFD) systems in the presence of faulty sensor data. The standard methods for handling faulty sensor data involve outlier detection techniques which assume that the faulty data is generated from the tail regions of the underlying data distribution and hence can be eliminated by modeling the high probability regions of the distribution. However this assumption is not always true and subtle faults in sensors can lead to recording of faulty data which can be thought of as being generated from a subtly perturbed version of the underlying distribution. Methods based on outlier detection techniques will fail to work under these settings and hence novel approaches are required for eliminating faulty data in such systems. In this work we explore the use of a Generative Adversarial Network (GAN) for this purpose. We train the generator network of the GAN to generate “fake” sensor data that mimics the distribution of the real data, albeit, a slightly perturbed one. We use this to train a discriminator network which learns to distinguish between the “real” and “fake” data generated from the generator. This discriminator is then used to filter out faulty sensor data generated from a perturbed version of the distribution generating the real data. We also build a simple regressor that uses the trained discriminator to perform robust regression on the CFD data after eliminating faulty sensor data. We tested the robust regression pipeline with CFD data for predicting fluid flow characteristics (specifically the angle of attack (AoA)) over a 2D foil. Our discriminator trained in a GAN framework could eliminate faulty sensor data, generated using the trained generator, with ∼ 100 % efficiency. The filtered data is then used for inference of the fluid flow parameters using the regressor.
{"title":"An Application of Generative Adversarial Networks for Robust Inference in Computational Fluid Dynamics","authors":"Chaity Banerjee, Chad Lilian, D. Reasor, E. Pasiliao, Tathagata Mukherjee","doi":"10.1145/3471287.3471304","DOIUrl":"https://doi.org/10.1145/3471287.3471304","url":null,"abstract":"In this paper we propose a robust learning pipeline for inference in computational fluid dynamics (CFD) systems in the presence of faulty sensor data. The standard methods for handling faulty sensor data involve outlier detection techniques which assume that the faulty data is generated from the tail regions of the underlying data distribution and hence can be eliminated by modeling the high probability regions of the distribution. However this assumption is not always true and subtle faults in sensors can lead to recording of faulty data which can be thought of as being generated from a subtly perturbed version of the underlying distribution. Methods based on outlier detection techniques will fail to work under these settings and hence novel approaches are required for eliminating faulty data in such systems. In this work we explore the use of a Generative Adversarial Network (GAN) for this purpose. We train the generator network of the GAN to generate “fake” sensor data that mimics the distribution of the real data, albeit, a slightly perturbed one. We use this to train a discriminator network which learns to distinguish between the “real” and “fake” data generated from the generator. This discriminator is then used to filter out faulty sensor data generated from a perturbed version of the distribution generating the real data. We also build a simple regressor that uses the trained discriminator to perform robust regression on the CFD data after eliminating faulty sensor data. We tested the robust regression pipeline with CFD data for predicting fluid flow characteristics (specifically the angle of attack (AoA)) over a 2D foil. Our discriminator trained in a GAN framework could eliminate faulty sensor data, generated using the trained generator, with ∼ 100 % efficiency. The filtered data is then used for inference of the fluid flow parameters using the regressor.","PeriodicalId":306474,"journal":{"name":"2021 the 5th International Conference on Information System and Data Mining","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122851072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, ransomware attacks have become increasingly rampant, resulting in many large companies or financial institutions suffering heavy losses from ransomware attacks. Bitcoin, is a means of payment demanded by the Ransomware Family. By comparing and analyzing the characteristics of bitcoin transactions, we can predict the types of Ransomware Family. Therefore, in this paper, the algorithm of machine learning is used to put forward the prediction method of Ransomware Family, so as to achieve the better effect of helping the attacked institutions to avoid being extorted effectively. In the traditional method, the judgment of Ransomware Family can only rely on human experience and subjective judgment, instead of accurate and batch analysis of Bitcoin transactions and prediction results. In this paper, a large number of known data sets of bitcoin's transaction features are used for analysis and modeling. First, we carried out descriptive statistical analysis to explore the differences between different Ransomware Families in bitcoin trading behavior. Next, we used a series of machine learning models to build the prediction model of Ransomware Family and conduct identification and classification, so as to help avoid financial losses from the Ransomware. Finally, we found that Ransomware family species were most significantly affected by year. In addition, it can be found that the accuracy of the Boosting model is the highest, and the test error is only about 3%.
{"title":"The Application of Machine Learning in Bitcoin Ransomware Family Prediction","authors":"Shengyun Xu","doi":"10.1145/3471287.3471300","DOIUrl":"https://doi.org/10.1145/3471287.3471300","url":null,"abstract":"In recent years, ransomware attacks have become increasingly rampant, resulting in many large companies or financial institutions suffering heavy losses from ransomware attacks. Bitcoin, is a means of payment demanded by the Ransomware Family. By comparing and analyzing the characteristics of bitcoin transactions, we can predict the types of Ransomware Family. Therefore, in this paper, the algorithm of machine learning is used to put forward the prediction method of Ransomware Family, so as to achieve the better effect of helping the attacked institutions to avoid being extorted effectively. In the traditional method, the judgment of Ransomware Family can only rely on human experience and subjective judgment, instead of accurate and batch analysis of Bitcoin transactions and prediction results. In this paper, a large number of known data sets of bitcoin's transaction features are used for analysis and modeling. First, we carried out descriptive statistical analysis to explore the differences between different Ransomware Families in bitcoin trading behavior. Next, we used a series of machine learning models to build the prediction model of Ransomware Family and conduct identification and classification, so as to help avoid financial losses from the Ransomware. Finally, we found that Ransomware family species were most significantly affected by year. In addition, it can be found that the accuracy of the Boosting model is the highest, and the test error is only about 3%.","PeriodicalId":306474,"journal":{"name":"2021 the 5th International Conference on Information System and Data Mining","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133348989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Acquiring and aggregating data from a group of individuals is crucial for studying their general behavior. Differentially Private (DP) techniques, characterized by the parameter ϵ, help to protect Individually Identifiable Data (IID) of individuals participating in such data collection. However, such techniques affect the usefulness of the data leading to a trade-off between usefulness and privacy, thereby making the selection of ϵ an important problem before data acquisition. In this work, we use a mathematical formalism to estimate usefulness and privacy for sum query as aggregate analysis for the local model of privacy. The mathematical relation enables the application of a variety of optimization techniques, discussed in the work, to select an optimal value of ϵ. Existing methods for selecting ϵ are based on financial parameters, but they heavily rely on past data and domain knowledge which may not be available in many cases. To address this, we have provided Knee-point based recommendations along with a selection criterion to choose the method of recommendation depending on the availability of information. This allows analysts to take enlightened decisions while negotiating the value of ϵ. Our experiments on synthetic and real-world datasets unambiguously demonstrate the strength of the mathematical model and the recommended values
{"title":"Selection and Verification of Privacy Parameters for Local Differentially Private Data Aggregation","authors":"Snehkumar Shahani, Abraham Jibi, R. Venkateswaran","doi":"10.1145/3471287.3471306","DOIUrl":"https://doi.org/10.1145/3471287.3471306","url":null,"abstract":"Acquiring and aggregating data from a group of individuals is crucial for studying their general behavior. Differentially Private (DP) techniques, characterized by the parameter ϵ, help to protect Individually Identifiable Data (IID) of individuals participating in such data collection. However, such techniques affect the usefulness of the data leading to a trade-off between usefulness and privacy, thereby making the selection of ϵ an important problem before data acquisition. In this work, we use a mathematical formalism to estimate usefulness and privacy for sum query as aggregate analysis for the local model of privacy. The mathematical relation enables the application of a variety of optimization techniques, discussed in the work, to select an optimal value of ϵ. Existing methods for selecting ϵ are based on financial parameters, but they heavily rely on past data and domain knowledge which may not be available in many cases. To address this, we have provided Knee-point based recommendations along with a selection criterion to choose the method of recommendation depending on the availability of information. This allows analysts to take enlightened decisions while negotiating the value of ϵ. Our experiments on synthetic and real-world datasets unambiguously demonstrate the strength of the mathematical model and the recommended values","PeriodicalId":306474,"journal":{"name":"2021 the 5th International Conference on Information System and Data Mining","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121042492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seema Shedage, Jake Farmer, Doga Demirel, Tansel Halic, S. Kockara, V. Arikatla, K. Sexton, Shahryar Ahmadi
Minimally invasive skills assessment is important in developing competent surgical simulators and executing reliable skills evaluation [9]. Arthroscopy and Laparoscopy surgeries are considered Minimally Invasive Surgeries (MIS). In MIS, the surgeon operates through small incisions with specialized narrow instruments, fiberoptic lights, and a monitor. Arthroscopy surgery is used to diagnose and treat joints problems, and Laparoscopic procedures are performed on the abdominal cavity. Due to non-natural hand-eye coordination, narrow field-of-view, and limited instrument control, MIS training is challenging to master. We are analyzing two simulators' data, Virtual Arthroscopic Tear Diagnosis and Evaluation Platform (VATDEP) and Gentleness Simulator. Both simulators went through the validation studies with human subjects. We recorded simulation data during the validation studies, such as tool motion, position, and task time. Recorded data went through the data preprocessing; after the data cleaning, we extracted the recoded data features and normalized them. Normalized features were used to input various machine learning algorithms, including K-nearest neighbor (KNN), Support vector machine (SVM), and Logistic regression (LR). The average accuracy was evaluated through k-fold cross-validation. The proposed methods validated using 10 subjects (5 experts, 5 novices) for the VATDEP simulator. 23 subjects (4 experts and 19 novices) for the Gentleness Simulator. The result shows a significant difference between the expert and novice population with the p < 0.05 using the Mann-Whitney U-test. The VATDEP simulator's classification algorithms' average accuracy is 74% and 80% for the Gentleness Simulator. The results show that the normalized features and with KNN, SVM, and LR classifiers can provide accurate classification of experts and novices. The evaluation technique proposed in this study can develop surgical training by providing appropriate feedback to trainees to evaluate proficiency.
{"title":"Development of Virtual Skill Trainers and Their Validation Study Analysis Using Machine Learning","authors":"Seema Shedage, Jake Farmer, Doga Demirel, Tansel Halic, S. Kockara, V. Arikatla, K. Sexton, Shahryar Ahmadi","doi":"10.1145/3471287.3471296","DOIUrl":"https://doi.org/10.1145/3471287.3471296","url":null,"abstract":"Minimally invasive skills assessment is important in developing competent surgical simulators and executing reliable skills evaluation [9]. Arthroscopy and Laparoscopy surgeries are considered Minimally Invasive Surgeries (MIS). In MIS, the surgeon operates through small incisions with specialized narrow instruments, fiberoptic lights, and a monitor. Arthroscopy surgery is used to diagnose and treat joints problems, and Laparoscopic procedures are performed on the abdominal cavity. Due to non-natural hand-eye coordination, narrow field-of-view, and limited instrument control, MIS training is challenging to master. We are analyzing two simulators' data, Virtual Arthroscopic Tear Diagnosis and Evaluation Platform (VATDEP) and Gentleness Simulator. Both simulators went through the validation studies with human subjects. We recorded simulation data during the validation studies, such as tool motion, position, and task time. Recorded data went through the data preprocessing; after the data cleaning, we extracted the recoded data features and normalized them. Normalized features were used to input various machine learning algorithms, including K-nearest neighbor (KNN), Support vector machine (SVM), and Logistic regression (LR). The average accuracy was evaluated through k-fold cross-validation. The proposed methods validated using 10 subjects (5 experts, 5 novices) for the VATDEP simulator. 23 subjects (4 experts and 19 novices) for the Gentleness Simulator. The result shows a significant difference between the expert and novice population with the p < 0.05 using the Mann-Whitney U-test. The VATDEP simulator's classification algorithms' average accuracy is 74% and 80% for the Gentleness Simulator. The results show that the normalized features and with KNN, SVM, and LR classifiers can provide accurate classification of experts and novices. The evaluation technique proposed in this study can develop surgical training by providing appropriate feedback to trainees to evaluate proficiency.","PeriodicalId":306474,"journal":{"name":"2021 the 5th International Conference on Information System and Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124388115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Financial market predicting is a popular theme of lots of researches in recent years. However, the majority of previous studies are focus on markets in great countries like China and United States, while some small countries are drawn less attention. To cover this shortage in current literature, we determined to use and compare 17 types of machine learning models to foresee Nepal market in this paper. Based on stock prices, 10 technical indicators were computed as input features. In addition, we also added emotional factors extracted from financial news to improve the prediction performance, which was evaluated by accuracy and F1 score. We predicted whether the closing price would rise or descend after three horizons: 1-day movement, 15-day movement and 30-day movement. From our experiment results, we found that linear SVM and XGBoost perform best and are the best options for further consideration in the trading process.
{"title":"Nepal Stock Market Movement Prediction with Machine Learning","authors":"Shu-Fei Zhao","doi":"10.1145/3471287.3471289","DOIUrl":"https://doi.org/10.1145/3471287.3471289","url":null,"abstract":"Financial market predicting is a popular theme of lots of researches in recent years. However, the majority of previous studies are focus on markets in great countries like China and United States, while some small countries are drawn less attention. To cover this shortage in current literature, we determined to use and compare 17 types of machine learning models to foresee Nepal market in this paper. Based on stock prices, 10 technical indicators were computed as input features. In addition, we also added emotional factors extracted from financial news to improve the prediction performance, which was evaluated by accuracy and F1 score. We predicted whether the closing price would rise or descend after three horizons: 1-day movement, 15-day movement and 30-day movement. From our experiment results, we found that linear SVM and XGBoost perform best and are the best options for further consideration in the trading process.","PeriodicalId":306474,"journal":{"name":"2021 the 5th International Conference on Information System and Data Mining","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132802087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recommender system is taking the lead among many things that the digital world offers today, to every customer visiting online portals for any service. Since its popularity from the time of Netflix competition, recommender system has become more visible and an important marketing and sales tool for corporates augmenting their offers online. Ongoing research initiatives in recommender systems, large datasets available for users across the globe, and corporate collaborations have led to improvised algorithms, and reduced errors in estimating predictions. Software and hardware tools that enable easy gathering of implicit and explicit data have helped recommender system to quickly adapt to the needs of the users. It is in this background the possibility of recommender system inducing the customer to pre-determined items by presenting fabricated predictions, as if it is resultant of scientific principles, need to be considered. In this paper, we give an overview of the recommender system, discuss how various components of the recommender system may be manipulated to allure innocent customers with false ratings, and also discuss the importance of engaging stakeholders to develop a robust recommender system.
{"title":"Recommender System: Personalizing User Experience or Scientifically Deceiving Users?","authors":"Ramachandran Trichur Narayanan","doi":"10.1145/3471287.3471303","DOIUrl":"https://doi.org/10.1145/3471287.3471303","url":null,"abstract":"Recommender system is taking the lead among many things that the digital world offers today, to every customer visiting online portals for any service. Since its popularity from the time of Netflix competition, recommender system has become more visible and an important marketing and sales tool for corporates augmenting their offers online. Ongoing research initiatives in recommender systems, large datasets available for users across the globe, and corporate collaborations have led to improvised algorithms, and reduced errors in estimating predictions. Software and hardware tools that enable easy gathering of implicit and explicit data have helped recommender system to quickly adapt to the needs of the users. It is in this background the possibility of recommender system inducing the customer to pre-determined items by presenting fabricated predictions, as if it is resultant of scientific principles, need to be considered. In this paper, we give an overview of the recommender system, discuss how various components of the recommender system may be manipulated to allure innocent customers with false ratings, and also discuss the importance of engaging stakeholders to develop a robust recommender system.","PeriodicalId":306474,"journal":{"name":"2021 the 5th International Conference on Information System and Data Mining","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123845560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Email templates have a significant impact on users in terms of productivity. Using an email template that is produced successfully is going to transfer the main information with a considerable impression. While the previous studies were focused on the email generation by text-differences in the content of the emails, generated templates based on email topics can provide better productivity for the companies. This article proposes a system, in which user emails are clustered according to the topics of the emails, and introduces an email template generation system that utilizes the sample emails belonging to the formed email clusters. For this purpose, the Enron email dataset has been used and the performance of different text preprocessing and topic modeling algorithms, such as DMM, GPU-DMM, GPU-PDMM, LF-DMM, LDA, LF-LDA, BTM, WNTM, PTM, SATM, have been investigated and compared to determine the most efficient one. After obtaining the email topics, the system shows the examples of the emails representing the selected topics and enables the authorized users to create templates that generalize these topics.
{"title":"Email Clustering & Generating Email Templates Based on Their Topics","authors":"Fatih Coşkun, C. Gezer, V. C. Gungor","doi":"10.1145/3471287.3471298","DOIUrl":"https://doi.org/10.1145/3471287.3471298","url":null,"abstract":"Email templates have a significant impact on users in terms of productivity. Using an email template that is produced successfully is going to transfer the main information with a considerable impression. While the previous studies were focused on the email generation by text-differences in the content of the emails, generated templates based on email topics can provide better productivity for the companies. This article proposes a system, in which user emails are clustered according to the topics of the emails, and introduces an email template generation system that utilizes the sample emails belonging to the formed email clusters. For this purpose, the Enron email dataset has been used and the performance of different text preprocessing and topic modeling algorithms, such as DMM, GPU-DMM, GPU-PDMM, LF-DMM, LDA, LF-LDA, BTM, WNTM, PTM, SATM, have been investigated and compared to determine the most efficient one. After obtaining the email topics, the system shows the examples of the emails representing the selected topics and enables the authorized users to create templates that generalize these topics.","PeriodicalId":306474,"journal":{"name":"2021 the 5th International Conference on Information System and Data Mining","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114497443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the growing attention towards Arabic Sentiment Analysis (SA), the availability of annotated dataset has raised. Although acquiring dataset from social media platforms, microblogs and so on is an easy task, annotation is the hard part. Dataset annotation requires a lot of manual tedious work which stands as a major problem. In addition to that, some datasets are built in house and aren't available for public access. This paper introduces the LASTD which is a manually annotated dataset for Arabic tweets sentiment analysis along with an insight of its statistics and benchmarks. It consists of more than 15K Arabic tweets annotated as positive, negative and neutral. Using 10-cross validation, three different classifiers were trained and tested for 3-class classification problem and 2-class classification problem. The support vector machine (SVM) classifier tends to have the highest accuracy. LASTD is made public for academic research.
{"title":"LASTD: A Manually Annotated and Tested Large Arabic Sentiment Tweets Dataset","authors":"Kariman Elshakankery, M. Fayek, Mona Farouk","doi":"10.1145/3471287.3471293","DOIUrl":"https://doi.org/10.1145/3471287.3471293","url":null,"abstract":"With the growing attention towards Arabic Sentiment Analysis (SA), the availability of annotated dataset has raised. Although acquiring dataset from social media platforms, microblogs and so on is an easy task, annotation is the hard part. Dataset annotation requires a lot of manual tedious work which stands as a major problem. In addition to that, some datasets are built in house and aren't available for public access. This paper introduces the LASTD which is a manually annotated dataset for Arabic tweets sentiment analysis along with an insight of its statistics and benchmarks. It consists of more than 15K Arabic tweets annotated as positive, negative and neutral. Using 10-cross validation, three different classifiers were trained and tested for 3-class classification problem and 2-class classification problem. The support vector machine (SVM) classifier tends to have the highest accuracy. LASTD is made public for academic research.","PeriodicalId":306474,"journal":{"name":"2021 the 5th International Conference on Information System and Data Mining","volume":"164 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116328218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}