首页 > 最新文献

Proceedings of the 2020 the 4th International Conference on Compute and Data Analysis最新文献

英文 中文
Sentiment and Emotion Analyses for Malaysian Mobile Digital Payment Applications 马来西亚移动数字支付应用程序的情绪和情感分析
Vimala Balakrishnan, Pravin Kumar Selvanayagam, L. Yin
1. Sentiment and emotion analyses provide a quick and easy way to infer users' perceptions regarding products, services, topics and events, and thus rendering it useful to businesses and government bodies for effective decision making. In this paper, we describe the outcomes of sentiment and emotion analyses performed on a mobile payment app, Boost, which is available in the Google Play Store. A total of 2463 text reviews were gathered, however, after pre-processing, 1054 of these reviews were annotated and used for sentiment and emotion analyses. Four supervised learning algorithms, namely, Support Vector Machine, Naïve Bayes, Decision Tree and Random Forest were compared using Python. Accuracy and F1 scores indicate Random Forest to have outperformed all the other algorithms for both sentiment and emotion analyses. A vast majority of the reviews were found to contain anger for the negative sentiments, whereas joy was observed for the positive reviews.
1. 情绪和情感分析提供了一种快速简便的方法来推断用户对产品、服务、主题和事件的看法,从而使其对企业和政府机构有效决策有用。在本文中,我们描述了在移动支付应用程序Boost上执行的情绪和情绪分析的结果,该应用程序可在谷歌Play Store中使用。总共收集了2463篇文本评论,然而,经过预处理后,其中1054篇评论被注释并用于情感和情感分析。使用Python对支持向量机、Naïve贝叶斯、决策树和随机森林四种监督学习算法进行了比较。准确性和F1分数表明Random Forest在情绪和情感分析方面的表现优于所有其他算法。研究发现,绝大多数评论都含有对消极情绪的愤怒,而对积极情绪的评论则含有喜悦。
{"title":"Sentiment and Emotion Analyses for Malaysian Mobile Digital Payment Applications","authors":"Vimala Balakrishnan, Pravin Kumar Selvanayagam, L. Yin","doi":"10.1145/3388142.3388144","DOIUrl":"https://doi.org/10.1145/3388142.3388144","url":null,"abstract":"1. Sentiment and emotion analyses provide a quick and easy way to infer users' perceptions regarding products, services, topics and events, and thus rendering it useful to businesses and government bodies for effective decision making. In this paper, we describe the outcomes of sentiment and emotion analyses performed on a mobile payment app, Boost, which is available in the Google Play Store. A total of 2463 text reviews were gathered, however, after pre-processing, 1054 of these reviews were annotated and used for sentiment and emotion analyses. Four supervised learning algorithms, namely, Support Vector Machine, Naïve Bayes, Decision Tree and Random Forest were compared using Python. Accuracy and F1 scores indicate Random Forest to have outperformed all the other algorithms for both sentiment and emotion analyses. A vast majority of the reviews were found to contain anger for the negative sentiments, whereas joy was observed for the positive reviews.","PeriodicalId":409298,"journal":{"name":"Proceedings of the 2020 the 4th International Conference on Compute and Data Analysis","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115886601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Africa's Malaria Epidemic Predictor: Application of Machine Learning on Malaria Incidence and Climate Data 非洲疟疾流行预测器:机器学习在疟疾发病率和气候数据上的应用
M. Masinde
The 2019 World Malaria Report confirms that Africa continue to bear the burden of malaria morbidity. The continent accounted for over 93% of the global malaria incidence reported in 2018. Despite the numerous multi-level and consultative efforts to combat this epidemic, malaria continues to claim thousands of human lives, especially those of children under 5 years of age. Since malaria is preventable and treatable, one of the solutions towards reducing the number of deaths is by implementing an effective malaria outbreak early warning system that can forecast malaria incidence long before they occur. This way, policymakers can put mitigation measures in place. Tapping into the success of machine learning algorithms in predicting disease outbreaks, we present a malaria outbreak prediction system that is anchored on the well-established correlation between certain climatic conditions and breeding environment of the malaria carrying vector (mosquito). Historical datasets on climate and malaria incidence are used to train nine machine learning algorithms and four best performing ones identified based on classification accuracy and computation performance. Preceding the models' development, reliability and correlation analysis was carried out on the data; this was then followed by reduction of the dimensionality of the feature space of the two datasets. Given the power of deep learning in handling selectivity variance, the malaria predictor system was developed based on the deep learning algorithm. Further, the evaluation of the system was done using the Simulator function in RapidMiner and the accuracy of the predictions assessed using an independent dataset that was not used in the models' development. With prediction accuracy of up to 99%, this system has the potential in contributing to the fight against malaria epidemic in Africa and elsewhere in the world.
《2019年世界疟疾报告》证实,非洲继续承担着疟疾发病率的负担。非洲大陆占2018年报告的全球疟疾发病率的93%以上。尽管为防治这一流行病作出了许多多层次的协商努力,但疟疾继续夺去成千上万人的生命,特别是5岁以下儿童的生命。由于疟疾是可预防和可治疗的,减少死亡人数的解决办法之一是实施有效的疟疾疫情早期预警系统,该系统可以在疟疾发病前很长时间对其进行预测。这样,政策制定者就可以将缓解措施落实到位。利用机器学习算法在预测疾病爆发方面的成功,我们提出了一个疟疾爆发预测系统,该系统基于某些气候条件与疟疾载体(蚊子)繁殖环境之间的既定相关性。利用气候和疟疾发病率的历史数据集训练了9种机器学习算法,并根据分类精度和计算性能确定了4种表现最佳的机器学习算法。在模型开发之前,对数据进行了信度分析和相关分析;然后对两个数据集的特征空间进行降维。考虑到深度学习在处理选择性方差方面的能力,基于深度学习算法开发了疟疾预测系统。此外,使用RapidMiner中的模拟器功能对系统进行评估,并使用独立数据集评估预测的准确性,该数据集未在模型开发中使用。该系统的预测准确率高达99%,有可能为非洲和世界其他地区防治疟疾疫情做出贡献。
{"title":"Africa's Malaria Epidemic Predictor: Application of Machine Learning on Malaria Incidence and Climate Data","authors":"M. Masinde","doi":"10.1145/3388142.3388158","DOIUrl":"https://doi.org/10.1145/3388142.3388158","url":null,"abstract":"The 2019 World Malaria Report confirms that Africa continue to bear the burden of malaria morbidity. The continent accounted for over 93% of the global malaria incidence reported in 2018. Despite the numerous multi-level and consultative efforts to combat this epidemic, malaria continues to claim thousands of human lives, especially those of children under 5 years of age. Since malaria is preventable and treatable, one of the solutions towards reducing the number of deaths is by implementing an effective malaria outbreak early warning system that can forecast malaria incidence long before they occur. This way, policymakers can put mitigation measures in place. Tapping into the success of machine learning algorithms in predicting disease outbreaks, we present a malaria outbreak prediction system that is anchored on the well-established correlation between certain climatic conditions and breeding environment of the malaria carrying vector (mosquito). Historical datasets on climate and malaria incidence are used to train nine machine learning algorithms and four best performing ones identified based on classification accuracy and computation performance. Preceding the models' development, reliability and correlation analysis was carried out on the data; this was then followed by reduction of the dimensionality of the feature space of the two datasets. Given the power of deep learning in handling selectivity variance, the malaria predictor system was developed based on the deep learning algorithm. Further, the evaluation of the system was done using the Simulator function in RapidMiner and the accuracy of the predictions assessed using an independent dataset that was not used in the models' development. With prediction accuracy of up to 99%, this system has the potential in contributing to the fight against malaria epidemic in Africa and elsewhere in the world.","PeriodicalId":409298,"journal":{"name":"Proceedings of the 2020 the 4th International Conference on Compute and Data Analysis","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123855716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
An Approach to Temporal Phase Classification on Videos of the Volleyball's Basic Reception Technique 排球基本接球技术视频的时间相位分类方法
Jose G. Garcia, Elizabeth R. Villota, C. B. Castañón
In this paper we provide an approach on sports analysis using Deep learning techniques. As part of a current project, the volleyball's basic reception technique has been divided into temporal phases. We performed an evaluation over our own labelled dataset consisting in 14814 frames from 69 videos depicting the desired reception technique. A model based on the YOLO algorithm was trained to locate the player region and trim the frames. Two time fusion methods over the frames wereproposed and evaluated with CNN models which were created based on the ResNet models and a transfer learning approach was used to train them. The results show that these models were able of classifying the frames with their corresponding phase with an accuracy of 92.21% in our best model. Also it can be seen that the RGB merging method shown in this paper helps to slightly improve the performance of the models. Furthermore, the models were capable of learning the temporality of the phases as the mistakes done by the models occurred between consecutive phases.
在本文中,我们提供了一种使用深度学习技术进行体育分析的方法。作为当前项目的一部分,排球的基本接收技术被划分为时间阶段。我们对我们自己的标记数据集进行了评估,该数据集由来自69个视频的14814帧组成,描绘了所需的接收技术。训练基于YOLO算法的模型来定位球员区域和裁剪帧。提出了两种帧间的时间融合方法,并使用基于ResNet模型的CNN模型进行了评估,并使用迁移学习方法对其进行了训练。结果表明,这些模型能够对帧进行相应相位的分类,在我们的最佳模型中,准确率达到92.21%。同时可以看出,本文提出的RGB合并方法有助于略微提高模型的性能。此外,由于模型所犯的错误发生在连续的阶段之间,模型能够学习阶段的时间性。
{"title":"An Approach to Temporal Phase Classification on Videos of the Volleyball's Basic Reception Technique","authors":"Jose G. Garcia, Elizabeth R. Villota, C. B. Castañón","doi":"10.1145/3388142.3388150","DOIUrl":"https://doi.org/10.1145/3388142.3388150","url":null,"abstract":"In this paper we provide an approach on sports analysis using Deep learning techniques. As part of a current project, the volleyball's basic reception technique has been divided into temporal phases. We performed an evaluation over our own labelled dataset consisting in 14814 frames from 69 videos depicting the desired reception technique. A model based on the YOLO algorithm was trained to locate the player region and trim the frames. Two time fusion methods over the frames wereproposed and evaluated with CNN models which were created based on the ResNet models and a transfer learning approach was used to train them. The results show that these models were able of classifying the frames with their corresponding phase with an accuracy of 92.21% in our best model. Also it can be seen that the RGB merging method shown in this paper helps to slightly improve the performance of the models. Furthermore, the models were capable of learning the temporality of the phases as the mistakes done by the models occurred between consecutive phases.","PeriodicalId":409298,"journal":{"name":"Proceedings of the 2020 the 4th International Conference on Compute and Data Analysis","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115005915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Study and Implementation of Virtual Reality and its Capabilities 虚拟现实技术及其功能的研究与实现
Jacob Stec, S. Shanmugam
Virtual Reality is a technology, which helps the user to interact with an environment of the simulation of the real world or an imaginary world. This paper details an application using the Unity game engine to showcase and demonstrate the capabilities of VR. It includes the capabilities such as: Movement of the player, to demonstrate the ability to move your player model around the play area using the controllers; Head tracking in game, to demonstrate the VR head tracking technology by moving the camera in the game to match your head position in real life; and using in game hands to mimic the positioning of your real hands in relation to your body, to interact with objects within the virtual space.
虚拟现实是一种帮助用户与模拟现实世界或想象世界的环境进行交互的技术。本文详细介绍了一个使用Unity游戏引擎来展示和演示VR功能的应用程序。它包括以下功能:玩家的移动,展示使用控制器在游戏区域内移动玩家模型的能力;游戏中的头部跟踪,通过在游戏中移动摄像头来匹配现实生活中的头部位置,演示VR头部跟踪技术;用游戏中的手来模仿你真实的手相对于你身体的位置,与虚拟空间中的物体互动。
{"title":"A Study and Implementation of Virtual Reality and its Capabilities","authors":"Jacob Stec, S. Shanmugam","doi":"10.1145/3388142.3388173","DOIUrl":"https://doi.org/10.1145/3388142.3388173","url":null,"abstract":"Virtual Reality is a technology, which helps the user to interact with an environment of the simulation of the real world or an imaginary world. This paper details an application using the Unity game engine to showcase and demonstrate the capabilities of VR. It includes the capabilities such as: Movement of the player, to demonstrate the ability to move your player model around the play area using the controllers; Head tracking in game, to demonstrate the VR head tracking technology by moving the camera in the game to match your head position in real life; and using in game hands to mimic the positioning of your real hands in relation to your body, to interact with objects within the virtual space.","PeriodicalId":409298,"journal":{"name":"Proceedings of the 2020 the 4th International Conference on Compute and Data Analysis","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128853512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
I-Privacy Photo: Face Recognition and Filtering i -隐私照片:人脸识别和过滤
Amal Almansour, Ghada Alsaeedi, Haifaa Almazroui, Huda Almuflehi
The ever-increasing popularity of Online Social Networks (OSNs) sites for posting and sharing photos and videos has led to unprecedented concerns on privacy violation. The available Online social networking (OSNs) sites offer a limited degree of privacy protection solutions. Most of the solutions focus on conditional access control meaning, allowing users to control who can access the shared photos and videos. This research study attempts to address this issue and study the scenario when a user shares a photo and video containing individuals other than himself/herself (public-level photos and videos). For privacy-preserving, the proposed system intends to support an automated human face recognition and filtering for public-level photos and videos. Our proposed approach takes into account the content of a photo and makes use of face filtering as a strategy to increase privacy while still allowing users to share photos. First, the proposed system automatically identifies a person face frame from a digital image or video. Next, it compares the detected face features to each face vectors stored in the application database. After face recognition step completed, the proposed system filters all un-known persons in the image. Conventual Neural Network (CNN) has been used for face detection step, while deep learning facial embedding algorithms has been used for the recognition. Both have shown high accuracy results in addition to the capability of being executed in real-time. For face filtering, Gaussian algorithm has been used for face blurring as it has been considered a very fast real-time algorithm which allow the user to control the blurring degree. Based on the obtained results after testing the system using three different datasets, we can conclude that our system can detect and recognize the faces in photos and videos using the improved Conventual Neural Network (CNN) for face detection with 91.3% accuracy and K-Nearest Neighbor (KNN) for the face recognition with 96.154% accuracy using I-Privacy dataset.
随着上传、分享照片和视频的网络社交网络(sns)日益普及,人们对侵犯个人隐私的担忧达到了前所未有的程度。现有的在线社交网络(Online social networking, osn)站点提供的隐私保护解决方案程度有限。大多数解决方案都侧重于条件访问控制,允许用户控制谁可以访问共享的照片和视频。本研究试图解决这个问题,并研究当用户分享包含他/她以外的个人的照片和视频(公共级照片和视频)时的场景。为了保护隐私,拟议的系统打算支持自动人脸识别和过滤公共级别的照片和视频。我们提出的方法考虑了照片的内容,并利用面部过滤作为一种策略来增加隐私,同时仍然允许用户分享照片。首先,该系统从数字图像或视频中自动识别人脸帧。然后,将检测到的人脸特征与存储在应用程序数据库中的每个人脸向量进行比较。在人脸识别步骤完成后,该系统对图像中所有未知的人进行过滤。人脸检测步骤采用了卷积神经网络(CNN),人脸识别步骤采用了深度学习人脸嵌入算法。除了实时执行的能力外,两者都显示出高精度的结果。对于人脸滤波,高斯算法被用于人脸模糊,因为它被认为是一种非常快速的实时算法,允许用户控制模糊程度。基于三种不同数据集的测试结果,我们可以得出结论,我们的系统使用改进的卷积神经网络(CNN)进行人脸检测和识别,准确率为91.3%,使用I-Privacy数据集进行k -最近邻(KNN)进行人脸识别,准确率为96.154%。
{"title":"I-Privacy Photo: Face Recognition and Filtering","authors":"Amal Almansour, Ghada Alsaeedi, Haifaa Almazroui, Huda Almuflehi","doi":"10.1145/3388142.3388161","DOIUrl":"https://doi.org/10.1145/3388142.3388161","url":null,"abstract":"The ever-increasing popularity of Online Social Networks (OSNs) sites for posting and sharing photos and videos has led to unprecedented concerns on privacy violation. The available Online social networking (OSNs) sites offer a limited degree of privacy protection solutions. Most of the solutions focus on conditional access control meaning, allowing users to control who can access the shared photos and videos. This research study attempts to address this issue and study the scenario when a user shares a photo and video containing individuals other than himself/herself (public-level photos and videos). For privacy-preserving, the proposed system intends to support an automated human face recognition and filtering for public-level photos and videos. Our proposed approach takes into account the content of a photo and makes use of face filtering as a strategy to increase privacy while still allowing users to share photos. First, the proposed system automatically identifies a person face frame from a digital image or video. Next, it compares the detected face features to each face vectors stored in the application database. After face recognition step completed, the proposed system filters all un-known persons in the image. Conventual Neural Network (CNN) has been used for face detection step, while deep learning facial embedding algorithms has been used for the recognition. Both have shown high accuracy results in addition to the capability of being executed in real-time. For face filtering, Gaussian algorithm has been used for face blurring as it has been considered a very fast real-time algorithm which allow the user to control the blurring degree. Based on the obtained results after testing the system using three different datasets, we can conclude that our system can detect and recognize the faces in photos and videos using the improved Conventual Neural Network (CNN) for face detection with 91.3% accuracy and K-Nearest Neighbor (KNN) for the face recognition with 96.154% accuracy using I-Privacy dataset.","PeriodicalId":409298,"journal":{"name":"Proceedings of the 2020 the 4th International Conference on Compute and Data Analysis","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128636445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Learning-based models to detect runtime phishing activities using URLs 使用url检测运行时网络钓鱼活动的基于学习的模型
Surya Srikar Sirigineedi, Jayesh Soni, Himanshu Upadhyay
Phishing websites are fraudulent sites that impersonate a trusted party to gain access to sensitive information of an individual person or organization. Traditionally, phishing website detection is done through the usage of blacklist databases. However, due to the current, rapid development of global networking and communication technologies, there are numerous websites and it has become difficult to classify based on traditional methods since new websites are created every second. In this paper, we are proposing a real-time, anti-phishing system. In the first step, we extract the lexical and host-based properties of a website. In the second step, we combine URL (Uniform Resource Locator) features, NLP and host-based properties to train the machine learning and deep learning models. Our detection model is able to detect phishing URLs with a detection rate of 94.89%.
网络钓鱼网站是一种欺诈性网站,它冒充受信任的一方来获取个人或组织的敏感信息。传统上,网络钓鱼网站检测是通过使用黑名单数据库来完成的。然而,由于当前全球网络和通信技术的快速发展,网站数量众多,由于每秒都有新的网站创建,因此很难用传统的方法进行分类。在本文中,我们提出了一个实时的反网络钓鱼系统。在第一步,我们提取一个网站的词法和基于主机的属性。第二步,我们结合URL(统一资源定位器)特征、NLP和基于主机的属性来训练机器学习和深度学习模型。我们的检测模型能够检测到网络钓鱼url,检测率为94.89%。
{"title":"Learning-based models to detect runtime phishing activities using URLs","authors":"Surya Srikar Sirigineedi, Jayesh Soni, Himanshu Upadhyay","doi":"10.1145/3388142.3388170","DOIUrl":"https://doi.org/10.1145/3388142.3388170","url":null,"abstract":"Phishing websites are fraudulent sites that impersonate a trusted party to gain access to sensitive information of an individual person or organization. Traditionally, phishing website detection is done through the usage of blacklist databases. However, due to the current, rapid development of global networking and communication technologies, there are numerous websites and it has become difficult to classify based on traditional methods since new websites are created every second. In this paper, we are proposing a real-time, anti-phishing system. In the first step, we extract the lexical and host-based properties of a website. In the second step, we combine URL (Uniform Resource Locator) features, NLP and host-based properties to train the machine learning and deep learning models. Our detection model is able to detect phishing URLs with a detection rate of 94.89%.","PeriodicalId":409298,"journal":{"name":"Proceedings of the 2020 the 4th International Conference on Compute and Data Analysis","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128810700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Fast Automatic Determination of Cluster Numbers for High Dimensional Big Data 高维大数据聚类数的快速自动确定
Z. Safari, Khalid T. Mursi, Yu Zhuang
For a large volume of data, the clustering algorithm is of significant importance to categorize and analyze data. Accordingly, choosing the optimal number of clusters (K) is an essential factor, but it also is a tricky problem in big data analysis. More importantly, it is to efficiently determine the best K automatically, which is the main issue in clustering algorithms. Indeed, considering both the quality and efficiency of the clustering algorithm during defining K can be a trade-off that is our primary purpose to overcome. K-Means is still one of the popular clustering algorithms, which has a shortcoming that K needs to be pre-set. We introduce a new process with fewer K-Means running, which selects the most promising time to run the K-Means algorithm. To achieve this goal, we applied Bisecting K-Means and a different splitting measure, which all are contributed to efficiently determine the number of clusters automatically while maintaining the quality of clustering for a large set of high dimensional data. We carried out our experimental studies on different data sets and found that our procedure has the flexibility of choosing different criteria for determining the optimal K under each of them. Experiments indicate higher efficiency through decreasing of computation cost compared with the Ray&Turi method or with the use of only the K-Means algorithm.
对于大量的数据,聚类算法对数据的分类和分析具有重要意义。因此,选择最优簇数(K)是一个必不可少的因素,但也是大数据分析中一个棘手的问题。更重要的是,有效地自动确定最佳K,这是聚类算法的主要问题。实际上,在定义K时考虑聚类算法的质量和效率可能是一个权衡,这是我们要克服的主要目的。K- means仍然是一种流行的聚类算法,其缺点是K需要预先设置。我们引入了一个较少运行K-Means的新过程,它选择最有希望运行K-Means算法的时间。为了实现这一目标,我们应用了平分K-Means和一种不同的分割度量,这些都有助于有效地自动确定聚类的数量,同时保持大量高维数据的聚类质量。我们对不同的数据集进行了实验研究,发现我们的程序具有灵活性,可以选择不同的标准来确定每个数据集下的最优K。实验表明,与Ray&Turi方法或仅使用K-Means算法相比,通过降低计算成本,提高了效率。
{"title":"Fast Automatic Determination of Cluster Numbers for High Dimensional Big Data","authors":"Z. Safari, Khalid T. Mursi, Yu Zhuang","doi":"10.1145/3388142.3388164","DOIUrl":"https://doi.org/10.1145/3388142.3388164","url":null,"abstract":"For a large volume of data, the clustering algorithm is of significant importance to categorize and analyze data. Accordingly, choosing the optimal number of clusters (K) is an essential factor, but it also is a tricky problem in big data analysis. More importantly, it is to efficiently determine the best K automatically, which is the main issue in clustering algorithms. Indeed, considering both the quality and efficiency of the clustering algorithm during defining K can be a trade-off that is our primary purpose to overcome. K-Means is still one of the popular clustering algorithms, which has a shortcoming that K needs to be pre-set. We introduce a new process with fewer K-Means running, which selects the most promising time to run the K-Means algorithm. To achieve this goal, we applied Bisecting K-Means and a different splitting measure, which all are contributed to efficiently determine the number of clusters automatically while maintaining the quality of clustering for a large set of high dimensional data. We carried out our experimental studies on different data sets and found that our procedure has the flexibility of choosing different criteria for determining the optimal K under each of them. Experiments indicate higher efficiency through decreasing of computation cost compared with the Ray&Turi method or with the use of only the K-Means algorithm.","PeriodicalId":409298,"journal":{"name":"Proceedings of the 2020 the 4th International Conference on Compute and Data Analysis","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125418833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Use of FP-Growth Algorithm in Identifying Influential Users on Twitter Hashtags 使用FP-Growth算法识别Twitter标签上有影响力的用户
Islam Elkabani, Layal Abu Daher, R. Zantout
Due to the spread of technology and World Wide Web, Online Social media invaded every home in the world; hence, the analysis of such networks became an important, yet challenging, case of study for researchers. One of the most interesting fields of study in social network analysis is to identify influential users who are important actors in online social networks. In this paper, identification of influential users on some trendy hashtags has been done. The data of these trendy hashtags has been collected between December 2015 and March 2016. For the identification of influential users from the trendy hashtags collected, Association Rule Learning has been employed. In order to investigate why users were detected as influential, different Influence Measures have been identified. The results of this study indicate the effectiveness of using Association Rule Learning for identifying influential users, moreover, detecting the most effective Influence Measures for these users.
由于技术和万维网的传播,在线社交媒体侵入了世界上的每个家庭;因此,对此类网络的分析成为研究人员的一个重要但具有挑战性的研究案例。社交网络分析中最有趣的研究领域之一是识别在线社交网络中的重要参与者——有影响力的用户。本文对一些热门话题标签上有影响力的用户进行了识别。这些热门话题标签的数据是在2015年12月至2016年3月期间收集的。为了从收集的流行标签中识别有影响力的用户,使用了关联规则学习。为了调查为什么用户被检测为有影响力的,已经确定了不同的影响措施。本研究的结果表明,使用关联规则学习识别有影响力的用户,并为这些用户检测最有效的影响措施的有效性。
{"title":"Use of FP-Growth Algorithm in Identifying Influential Users on Twitter Hashtags","authors":"Islam Elkabani, Layal Abu Daher, R. Zantout","doi":"10.1145/3388142.3388148","DOIUrl":"https://doi.org/10.1145/3388142.3388148","url":null,"abstract":"Due to the spread of technology and World Wide Web, Online Social media invaded every home in the world; hence, the analysis of such networks became an important, yet challenging, case of study for researchers. One of the most interesting fields of study in social network analysis is to identify influential users who are important actors in online social networks. In this paper, identification of influential users on some trendy hashtags has been done. The data of these trendy hashtags has been collected between December 2015 and March 2016. For the identification of influential users from the trendy hashtags collected, Association Rule Learning has been employed. In order to investigate why users were detected as influential, different Influence Measures have been identified. The results of this study indicate the effectiveness of using Association Rule Learning for identifying influential users, moreover, detecting the most effective Influence Measures for these users.","PeriodicalId":409298,"journal":{"name":"Proceedings of the 2020 the 4th International Conference on Compute and Data Analysis","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131519822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hybrid Scalable Action Rule: Rule Based and Object Based 混合可伸缩动作规则:基于规则和基于对象
Jaishree Ranganathan, Sagar Sharma, A. Tzacheva
Action Rule mining is a method to extract actionable pattern from datasets. Classification rules are those which helps predict the object's class, whereas Action Rules are actionable knowledge that provide suggestions on how an objects state or class can be changed to a more desirable state to benefit the user. In the internet era, digital data is wide spread and growing tremendously is such way that it is neccessary to develop systems that process the data in a much faster way. The literature of Action Rule mining involves two major frameworks; Rule-Based method: where extraction of Action Rules is dependent on the pre-processing step of classification rule discovery, and Object Based Method: extracts Action Rule directly from the database without the use of classification rules. Object based method extracts Action Rule in a apriori like method using frequent action sets. Since this method is iterative it takes longer time to process huge datasets. In this work we propose a novel hybrid approach to generate complete set of Action Rules by combining the Rule-Based and Object-Based methods. Our results show a significant improvement, where the existing algorithm does not span for the Twitter dataset. On the other hand the proposed hybrid approach completed execution and produces Action Rules in less than 500 seconds on a Cluster.
动作规则挖掘是一种从数据集中提取可操作模式的方法。分类规则是那些帮助预测对象类别的规则,而动作规则是可操作的知识,它提供关于如何将对象状态或类别更改为更理想的状态以使用户受益的建议。在互联网时代,数字数据的广泛传播和巨大增长是这样的方式,有必要开发系统,以更快的方式处理数据。动作规则挖掘的文献涉及两个主要框架;基于规则的方法:动作规则的提取依赖于分类规则发现的预处理步骤;基于对象的方法:直接从数据库中提取动作规则,不使用分类规则。基于对象的方法以一种类似先验的方法,利用频繁的动作集提取动作规则。由于该方法是迭代的,因此处理大型数据集需要更长的时间。在这项工作中,我们提出了一种新的混合方法,通过结合基于规则和基于对象的方法来生成完整的动作规则集。我们的结果显示了一个显著的改进,其中现有的算法不能跨越Twitter数据集。另一方面,提出的混合方法在不到500秒的时间内完成了集群上的执行并生成动作规则。
{"title":"Hybrid Scalable Action Rule: Rule Based and Object Based","authors":"Jaishree Ranganathan, Sagar Sharma, A. Tzacheva","doi":"10.1145/3388142.3388143","DOIUrl":"https://doi.org/10.1145/3388142.3388143","url":null,"abstract":"Action Rule mining is a method to extract actionable pattern from datasets. Classification rules are those which helps predict the object's class, whereas Action Rules are actionable knowledge that provide suggestions on how an objects state or class can be changed to a more desirable state to benefit the user. In the internet era, digital data is wide spread and growing tremendously is such way that it is neccessary to develop systems that process the data in a much faster way. The literature of Action Rule mining involves two major frameworks; Rule-Based method: where extraction of Action Rules is dependent on the pre-processing step of classification rule discovery, and Object Based Method: extracts Action Rule directly from the database without the use of classification rules. Object based method extracts Action Rule in a apriori like method using frequent action sets. Since this method is iterative it takes longer time to process huge datasets. In this work we propose a novel hybrid approach to generate complete set of Action Rules by combining the Rule-Based and Object-Based methods. Our results show a significant improvement, where the existing algorithm does not span for the Twitter dataset. On the other hand the proposed hybrid approach completed execution and produces Action Rules in less than 500 seconds on a Cluster.","PeriodicalId":409298,"journal":{"name":"Proceedings of the 2020 the 4th International Conference on Compute and Data Analysis","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121909922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Vulnerability Prioritization, Root Cause Analysis, and Mitigation of Secure Data Analytic Framework Implemented with MongoDB on Singularity Linux Containers MongoDB在Singularity Linux容器上实现的安全数据分析框架的漏洞优先级、根本原因分析和缓解
Akalanka Mailewa Dissanayaka, S. Mengel, L. Gittner, H. Khan
A Vulnerability Management system is a disciplined, programmatic approach to discover and mitigate vulnerabilities in a system. While securing systems from data exploitation and theft, Vulnerability Management works as a cyclical practice of identifying, assessing, prioritizing, remediating, and mitigating security weaknesses. In this approach, root cause analysis is conducted to find solutions for the problematic areas in policy, process, and standards including configuration standards. Three major reasons make Vulnerability Assessment and Management a vital part in IT risk management. The reasons are, namely, 1. Persistent Threats - Attacks exploiting security vulnerabilities for financial gain and criminal agendas continue to dominate headlines, 2. Regulations - Many government and industry regulations, such as the Health Insurance Portability and Accountability Act (HIPAA) and Sarbanes-Oxley (SOX), mandate rigorous vulnerability management practices, and 3. Risk Management - Mature organizations treat vulnerability assessment and management as a key risk management component [1]. Thus, as opposed to a reactive and technology-oriented approach, a well-organized and executed Vulnerability Management system is proactive and business-oriented. This research initially collects all the vulnerabilities associated with the Data Analytic Framework Implemented with MongoDB on Linux Containers (LXCs) by using the vulnerability analysis testbed with seven deferent analyzing tools. Thereafter, this research work first prioritizes all the vulnerabilities using "Low", "Medium", and "High" according to their severity level. Then, it discovers and analyzes the root cause of fifteen various vulnerabilities with different severities. Finally, according to each of the vulnerability root causes, this research proposes security techniques, to avoid or mitigate those vulnerabilities from the current system.
漏洞管理系统是一种规范的、程序化的方法,用于发现和减轻系统中的漏洞。在保护系统免受数据利用和盗窃的同时,漏洞管理作为识别、评估、确定优先级、修复和减轻安全弱点的周期性实践。在这种方法中,进行根本原因分析,以找到策略、流程和标准(包括配置标准)中有问题区域的解决方案。主要有三个原因使得脆弱性评估和管理成为IT风险管理的重要组成部分。原因是:1。持续的威胁-利用安全漏洞获取经济利益和犯罪议程的攻击继续占据头条新闻。法规—许多政府和行业法规,如《健康保险可携带性和责任法案》(HIPAA)和《萨班斯-奥克斯利法案》(SOX),要求严格的漏洞管理实践;风险管理——成熟的组织将脆弱性评估和管理视为风险管理的关键组成部分[1]。因此,与被动的和面向技术的方法相反,组织良好并执行良好的漏洞管理系统是主动的和面向业务的。本研究通过使用包含七种不同分析工具的漏洞分析测试平台,初步收集了与MongoDB on Linux Containers (LXCs)相关的所有漏洞。随后,本研究工作首先根据漏洞的严重程度,用“低”、“中”、“高”对所有漏洞进行优先级排序。然后,发现并分析了15个不同严重程度的漏洞的根本原因。最后,根据每个漏洞的根源,本研究提出了安全技术,以避免或减轻这些漏洞来自当前系统。
{"title":"Vulnerability Prioritization, Root Cause Analysis, and Mitigation of Secure Data Analytic Framework Implemented with MongoDB on Singularity Linux Containers","authors":"Akalanka Mailewa Dissanayaka, S. Mengel, L. Gittner, H. Khan","doi":"10.1145/3388142.3388168","DOIUrl":"https://doi.org/10.1145/3388142.3388168","url":null,"abstract":"A Vulnerability Management system is a disciplined, programmatic approach to discover and mitigate vulnerabilities in a system. While securing systems from data exploitation and theft, Vulnerability Management works as a cyclical practice of identifying, assessing, prioritizing, remediating, and mitigating security weaknesses. In this approach, root cause analysis is conducted to find solutions for the problematic areas in policy, process, and standards including configuration standards. Three major reasons make Vulnerability Assessment and Management a vital part in IT risk management. The reasons are, namely, 1. Persistent Threats - Attacks exploiting security vulnerabilities for financial gain and criminal agendas continue to dominate headlines, 2. Regulations - Many government and industry regulations, such as the Health Insurance Portability and Accountability Act (HIPAA) and Sarbanes-Oxley (SOX), mandate rigorous vulnerability management practices, and 3. Risk Management - Mature organizations treat vulnerability assessment and management as a key risk management component [1]. Thus, as opposed to a reactive and technology-oriented approach, a well-organized and executed Vulnerability Management system is proactive and business-oriented. This research initially collects all the vulnerabilities associated with the Data Analytic Framework Implemented with MongoDB on Linux Containers (LXCs) by using the vulnerability analysis testbed with seven deferent analyzing tools. Thereafter, this research work first prioritizes all the vulnerabilities using \"Low\", \"Medium\", and \"High\" according to their severity level. Then, it discovers and analyzes the root cause of fifteen various vulnerabilities with different severities. Finally, according to each of the vulnerability root causes, this research proposes security techniques, to avoid or mitigate those vulnerabilities from the current system.","PeriodicalId":409298,"journal":{"name":"Proceedings of the 2020 the 4th International Conference on Compute and Data Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130840930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
期刊
Proceedings of the 2020 the 4th International Conference on Compute and Data Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1