For many tasks in market research it is important to model customers and products as comparable instances. Usually, the integration of customers and products into one model is not trivial. In this paper, we will detail an approach for a combined vector space of customers and products based on word embeddings learned from receipt data. To highlight the strengths of this approach we propose four different applications: recommender systems, customer and product segmentation and purchase prediction. Experimental results on a real-world dataset with 200M order receipts for 2M customers show that our word embedding approach is promising and helps to improve the quality in these applications scenarios.
{"title":"Modeling Customers and Products with Word Embeddings from Receipt Data","authors":"Lucas Woltmann, Maik Thiele, Wolfgang Lehner","doi":"10.1145/3216122.3229860","DOIUrl":"https://doi.org/10.1145/3216122.3229860","url":null,"abstract":"For many tasks in market research it is important to model customers and products as comparable instances. Usually, the integration of customers and products into one model is not trivial. In this paper, we will detail an approach for a combined vector space of customers and products based on word embeddings learned from receipt data. To highlight the strengths of this approach we propose four different applications: recommender systems, customer and product segmentation and purchase prediction. Experimental results on a real-world dataset with 200M order receipts for 2M customers show that our word embedding approach is promising and helps to improve the quality in these applications scenarios.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125101362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Extended argumentation frameworks (EAFs) extend Dung's argumentation frameworks (AFs) to represent a kind of defeasible attack (by relying on the concept of second-order attack), in addition to the Dung's classical notion of attack between arguments. EAFs can be profitably used to model disputes between agents, with the aim of deciding the sets of arguments (called extensions) that should be accepted to support a point of view in a discussion. However, since new arguments and attacks are often introduced to take into account new available knowledge, EAFs as well as their extensions change over the time. In this paper we tackle the problem of efficiently recomputing extensions of dynamic EAFs under two well-known semantics (i.e., preferred and stable semantics). We introduce an incremental approach that, given an initial EAF, an initial extension for it, and an update, computes an extension of the updated EAF. This is achieved by introducing a meta-argumentation transformation according to which an initial EAF, as well as a given initial extension and an update, is transformed into a plain argumentation framework with a corresponding extension and update. The proposed approach is able to incorporate existing AF-solvers to compute an extension of the updated EAF. The experimental analysis showed that our technique is significantly faster than computing extensions of updated EAFs from scratch.
{"title":"Computing Extensions of Dynamic Abstract Argumentation Frameworks with Second-Order Attacks","authors":"Gianvincenzo Alfano, S. Greco, F. Parisi","doi":"10.1145/3216122.3216162","DOIUrl":"https://doi.org/10.1145/3216122.3216162","url":null,"abstract":"Extended argumentation frameworks (EAFs) extend Dung's argumentation frameworks (AFs) to represent a kind of defeasible attack (by relying on the concept of second-order attack), in addition to the Dung's classical notion of attack between arguments. EAFs can be profitably used to model disputes between agents, with the aim of deciding the sets of arguments (called extensions) that should be accepted to support a point of view in a discussion. However, since new arguments and attacks are often introduced to take into account new available knowledge, EAFs as well as their extensions change over the time. In this paper we tackle the problem of efficiently recomputing extensions of dynamic EAFs under two well-known semantics (i.e., preferred and stable semantics). We introduce an incremental approach that, given an initial EAF, an initial extension for it, and an update, computes an extension of the updated EAF. This is achieved by introducing a meta-argumentation transformation according to which an initial EAF, as well as a given initial extension and an update, is transformed into a plain argumentation framework with a corresponding extension and update. The proposed approach is able to incorporate existing AF-solvers to compute an extension of the updated EAF. The experimental analysis showed that our technique is significantly faster than computing extensions of updated EAFs from scratch.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126144615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Sideris, Dimitrios Katsaros, Antonis Sidiropoulos, Y. Manolopoulos
The deluge of data on scholarly output created unique opportunities for identifying the drivers of modern science, for studying career paths of scientists, and for measuring the research performance. These massive data and processing methodologies have given rise to an exciting new field, namely Science of Science (SoS) as the successor of what is called scientometrics or informetrics for many decades. Science of Science is the offspring of the fertile cooperation of many disciplines, such as network science, statistics, machine learning, mathematical analysis, sociology of science and so on. In this article, we provide a comprehensive coverage of recent advances in SoS related to network analysis, prediction and ranking, and investigate the issue of scientist ranking from a multilayer network perspective. Towards this goal, we contrast by experiments the well-known h-index and the recently proposed indicator C3-index to a generalization of PageRank for multilayer networks, namely BiPlex PageRank, which is based on solid tensor analysis. Both the obtained results and the brief survey of SoS will deepen our faith to SoS and stimulate further efforts in this transdisciplinary field.
{"title":"The Science of Science and a Multilayer Network Approach to Scientists' Ranking","authors":"G. Sideris, Dimitrios Katsaros, Antonis Sidiropoulos, Y. Manolopoulos","doi":"10.1145/3216122.3229862","DOIUrl":"https://doi.org/10.1145/3216122.3229862","url":null,"abstract":"The deluge of data on scholarly output created unique opportunities for identifying the drivers of modern science, for studying career paths of scientists, and for measuring the research performance. These massive data and processing methodologies have given rise to an exciting new field, namely Science of Science (SoS) as the successor of what is called scientometrics or informetrics for many decades. Science of Science is the offspring of the fertile cooperation of many disciplines, such as network science, statistics, machine learning, mathematical analysis, sociology of science and so on. In this article, we provide a comprehensive coverage of recent advances in SoS related to network analysis, prediction and ranking, and investigate the issue of scientist ranking from a multilayer network perspective. Towards this goal, we contrast by experiments the well-known h-index and the recently proposed indicator C3-index to a generalization of PageRank for multilayer networks, namely BiPlex PageRank, which is based on solid tensor analysis. Both the obtained results and the brief survey of SoS will deepen our faith to SoS and stimulate further efforts in this transdisciplinary field.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121297121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The structure of the ever increasing large RDF repositories is too complex to allow non-expert users extract useful information from them. Keyword search is an interesting alternative but in the context of RDF graph data, where query answers are RDF graph fragments, itfaces two major problems: the query quality answer problem and the result computation algorithm scalability problem. In this paper we focus on empowering keyword search on RDF data by exploiting personalized information. We proposean original approach which exploits the structural summary of the RDF graph to generate pattern graphs for the input keyword query. Pattern graphs are structured conjunctive queries and are seen as possible interpretations of the unstructured keyword query. Personalized information is represented as collections of profile graphs, a concept similar to pattern graphs. The ran king of the results is achieved by measuring graph similarity between the user profile graph and the generated pattern graphs. Novel similarity metrics have been introduced which consider intrinsic and extrinsic similarity and take into account both structural and semantic characteristics of the pattern and profile graphs. Effectiveness and efficiency experimental results show that our approach can tackle the two major problems that hinder the widespread use of keyword search on RDF data.
{"title":"Personalized Keyword Search on Large RDF Graphs based on Pattern Graph Similarity","authors":"S. Sinha, Xinge Lu, D. Theodoratos","doi":"10.1145/3216122.3216167","DOIUrl":"https://doi.org/10.1145/3216122.3216167","url":null,"abstract":"The structure of the ever increasing large RDF repositories is too complex to allow non-expert users extract useful information from them. Keyword search is an interesting alternative but in the context of RDF graph data, where query answers are RDF graph fragments, itfaces two major problems: the query quality answer problem and the result computation algorithm scalability problem. In this paper we focus on empowering keyword search on RDF data by exploiting personalized information. We proposean original approach which exploits the structural summary of the RDF graph to generate pattern graphs for the input keyword query. Pattern graphs are structured conjunctive queries and are seen as possible interpretations of the unstructured keyword query. Personalized information is represented as collections of profile graphs, a concept similar to pattern graphs. The ran king of the results is achieved by measuring graph similarity between the user profile graph and the generated pattern graphs. Novel similarity metrics have been introduced which consider intrinsic and extrinsic similarity and take into account both structural and semantic characteristics of the pattern and profile graphs. Effectiveness and efficiency experimental results show that our approach can tackle the two major problems that hinder the widespread use of keyword search on RDF data.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131764845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the last decades, Social Networks (SNs) have deeply changed interactions and habits of the users that are also prone to create more than one profile on the same SN. On the flip side, fake profiles (i.e., impersonating profiles), have become a considerable problem in digital investigations. In this paper, we propose a method for user profiles resolution through a cluster-based approach of the smartphone fingerprints extracted from the images being posted on SNs. The proposed method is thus able to detect fake profiles. To evaluate our approach, we use a real dataset of 1,500 images from 10 different smartphone devices and Facebook and WhatsApp platforms. The results show that the average of sensitivity and specificity for user profiles resolution is about 98%.
{"title":"A Cluster-based Approach of Smartphone Camera Fingerprint for User Profiles Resolution within Social Network","authors":"R. Rouhi, Flavio Bertini, D. Montesi","doi":"10.1145/3216122.3216123","DOIUrl":"https://doi.org/10.1145/3216122.3216123","url":null,"abstract":"In the last decades, Social Networks (SNs) have deeply changed interactions and habits of the users that are also prone to create more than one profile on the same SN. On the flip side, fake profiles (i.e., impersonating profiles), have become a considerable problem in digital investigations. In this paper, we propose a method for user profiles resolution through a cluster-based approach of the smartphone fingerprints extracted from the images being posted on SNs. The proposed method is thus able to detect fake profiles. To evaluate our approach, we use a real dataset of 1,500 images from 10 different smartphone devices and Facebook and WhatsApp platforms. The results show that the average of sensitivity and specificity for user profiles resolution is about 98%.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"156 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134484253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The web was ushered in with great expectations, formally in May 1994, in a conference called World Wide Web I, This event, in hindsight, is sometimes referred to as the Woodstock of the web. The web and Mosaic, the graphical browser, which was announced soon after has revolutionized the internet. For most people, the internet is the web, while one of the monopolist tech-corporations wants the world to view their platforms to be not only the web but the Internet! The web has given rise to a number of rich powerful corporations which did not exist before its advent. The easy to use graphical interface and the cell phone with its tiny screen have become the de-facto interface to all kinds of applications and have provided new methods of communication and connections. The control of all this by a small number of monopolistic corporations, who have amassed last quantities of data on people, has created a situation which has become a web of betrayal of the promise of sharing and providing information, freely. We also consider the remote possibility of a new freer web without monopolies
1994年5月,在一次名为“万维网ⅰ”(World Wide web I)的会议上,人们满怀期待地迎来了网络。事后看来,这次会议有时被称为网络界的伍德斯托克音乐节。网页和马赛克,图形浏览器,不久之后宣布,已经彻底改变了互联网。对大多数人来说,互联网就是网络,而一家垄断的科技公司却希望全世界认为他们的平台不仅是网络,而且是互联网!网络催生了许多在它出现之前并不存在的富有而强大的公司。易于使用的图形界面和小屏幕的手机已经成为各种应用程序的实际接口,并提供了新的通信和连接方式。这一切都被少数垄断公司所控制,他们积累了最后一批关于人们的数据,造成了一种局面,这种局面已经成为一个背叛自由分享和提供信息的承诺的网络。我们还考虑了一个没有垄断的更自由的新网络的遥远可能性
{"title":"The Web of Betrayals","authors":"B. Desai","doi":"10.1145/3216122.3216140","DOIUrl":"https://doi.org/10.1145/3216122.3216140","url":null,"abstract":"The web was ushered in with great expectations, formally in May 1994, in a conference called World Wide Web I, This event, in hindsight, is sometimes referred to as the Woodstock of the web. The web and Mosaic, the graphical browser, which was announced soon after has revolutionized the internet. For most people, the internet is the web, while one of the monopolist tech-corporations wants the world to view their platforms to be not only the web but the Internet! The web has given rise to a number of rich powerful corporations which did not exist before its advent. The easy to use graphical interface and the cell phone with its tiny screen have become the de-facto interface to all kinds of applications and have provided new methods of communication and connections. The control of all this by a small number of monopolistic corporations, who have amassed last quantities of data on people, has created a situation which has become a web of betrayal of the promise of sharing and providing information, freely. We also consider the remote possibility of a new freer web without monopolies","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125207381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thi-Thanh-Quynh Nguyen, Christophe Bobineau, V. Debusschere, Quang-Huy Giap, N. Hadjsaid
In the control and management of smart grids, from steady state to real-time, the objective is to handle and to treat any change in the system as fast as possible, with as less resources as possible. In this context, this paper proposes a new language, called Smartlog, designed as declarative programming. Smartlog is developed for distributed computing in real-time and distributed database management. Compared to imperative programming, based on anticipation rather than reaction, the interest is to not ignore the meaning of some data, or to collect and analyze data without interest, and thus loose bandwidth and computational time. Smartlog is designed for operating smart grids, which are defined as abstract structures of large and scalable distributed databases. After its definition, the main features of the Smartlog language are its compactness, its simplicity and its scalability are shown. The language is tested on the application of a frequency and voltage secondary control of an islanded micro-grid in an experimental test-case, using a realtime simulator connected to Raspberry Pis. The characteristics of Smartlog are illustrated thanks to a comparison with an imperative programming implementation of the same regulation.
{"title":"Using declarative programming for network data management in smart grids","authors":"Thi-Thanh-Quynh Nguyen, Christophe Bobineau, V. Debusschere, Quang-Huy Giap, N. Hadjsaid","doi":"10.1145/3216122.3216160","DOIUrl":"https://doi.org/10.1145/3216122.3216160","url":null,"abstract":"In the control and management of smart grids, from steady state to real-time, the objective is to handle and to treat any change in the system as fast as possible, with as less resources as possible. In this context, this paper proposes a new language, called Smartlog, designed as declarative programming. Smartlog is developed for distributed computing in real-time and distributed database management. Compared to imperative programming, based on anticipation rather than reaction, the interest is to not ignore the meaning of some data, or to collect and analyze data without interest, and thus loose bandwidth and computational time. Smartlog is designed for operating smart grids, which are defined as abstract structures of large and scalable distributed databases. After its definition, the main features of the Smartlog language are its compactness, its simplicity and its scalability are shown. The language is tested on the application of a frequency and voltage secondary control of an islanded micro-grid in an experimental test-case, using a realtime simulator connected to Raspberry Pis. The characteristics of Smartlog are illustrated thanks to a comparison with an imperative programming implementation of the same regulation.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129364262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pietro Cinaglia, G. Tradigo, G. Cascini, E. Zumpano, P. Veltri
Extracting morphological features from DICOM images is useful to obtain numerical anatomic values for population-wide studies. Currently software tools on medical devices are able to extract some parameters that can indicate the presence of diseases. Nevertheless, there still is a lot of not exploited information contained in images which can be useful for research as well as to characterize human behavior. For instance, measures for lung volume compared with reference data sets can be studied starting from clinical images. In this paper we report preliminary results on a framework for the acquisition and decomposition of DICOM images applied on a dataset containing lung exams from which we extracted information and parameters useful for disease research studies. The here proposed algorithms for images segmentation and anatomical features extraction have been tested on a clinical dataset obtained from University Hospital of Catanzaro, providing the framework validity.
{"title":"A framework for the decomposition and features extraction from lung DICOM images","authors":"Pietro Cinaglia, G. Tradigo, G. Cascini, E. Zumpano, P. Veltri","doi":"10.1145/3216122.3216127","DOIUrl":"https://doi.org/10.1145/3216122.3216127","url":null,"abstract":"Extracting morphological features from DICOM images is useful to obtain numerical anatomic values for population-wide studies. Currently software tools on medical devices are able to extract some parameters that can indicate the presence of diseases. Nevertheless, there still is a lot of not exploited information contained in images which can be useful for research as well as to characterize human behavior. For instance, measures for lung volume compared with reference data sets can be studied starting from clinical images. In this paper we report preliminary results on a framework for the acquisition and decomposition of DICOM images applied on a dataset containing lung exams from which we extracted information and parameters useful for disease research studies. The here proposed algorithms for images segmentation and anatomical features extraction have been tested on a clinical dataset obtained from University Hospital of Catanzaro, providing the framework validity.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"474 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123340100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cities are the main poles of human and economic activity. Analyzing cities data is very important to improve the city economy as well as the life quality of the citizens. Since location based services and GPS devices can easily connect users located in different positions, it is worthwhile to optimize the efficiency of their shifting to a common location according to their preferences. For this reason, the support of advanced analysis queries such as the skyline operator has become important. This later finds the interesting objects according to a user preferences. However, data in such application can be uncertain, imprecise and incomplete. In this paper, we propose an imperfect spatial skyline query for users located in different positions. Detailed experimental analysis are reported. In addition, the theoretical properties developed in this paper help to devise efficient techniques to compute the spatial skyline over uncertain data fora set of users. Our extensive experiments show that the proposed algorithms provide quick initial response time.
{"title":"A Step forward for Spatial Skyline Queries for a Group of Users: Semantic in the Evidence Theory Setting","authors":"Sayda Elmi, Jun-Ki Min","doi":"10.1145/3216122.3216142","DOIUrl":"https://doi.org/10.1145/3216122.3216142","url":null,"abstract":"Cities are the main poles of human and economic activity. Analyzing cities data is very important to improve the city economy as well as the life quality of the citizens. Since location based services and GPS devices can easily connect users located in different positions, it is worthwhile to optimize the efficiency of their shifting to a common location according to their preferences. For this reason, the support of advanced analysis queries such as the skyline operator has become important. This later finds the interesting objects according to a user preferences. However, data in such application can be uncertain, imprecise and incomplete. In this paper, we propose an imperfect spatial skyline query for users located in different positions. Detailed experimental analysis are reported. In addition, the theoretical properties developed in this paper help to devise efficient techniques to compute the spatial skyline over uncertain data fora set of users. Our extensive experiments show that the proposed algorithms provide quick initial response time.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124330317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The need to support advanced analytics on Big Data is driving data scientist' interest toward massively parallel distributed systems and software platforms, such as Map-Reduce and Spark, that make possible their scalable utilization. However, when complex data mining algorithms are required, their fully scalable deployment on such platforms faces a number of technical challenges that grow with the complexity of the algorithms involved. Thus algorithms, that were originally designed for a sequential nature, must often be redesigned in order to effectively use the distributed computational resources. In this paper, we explore these problems, and then propose a solution which has proven to be very effective on the complex hierarchical clustering algorithm CLUBS+. By using four stages of successive refinements, CLUBS+ delivers high-quality clusters of data grouped around their centroids, working in a totally unsupervised fashion. Experimental results confirm the accuracy and scalability of CLUBS+.
{"title":"Efficient Big Data Clustering","authors":"M. Ianni, E. Masciari, G. Mazzeo, C. Zaniolo","doi":"10.1145/3216122.3216154","DOIUrl":"https://doi.org/10.1145/3216122.3216154","url":null,"abstract":"The need to support advanced analytics on Big Data is driving data scientist' interest toward massively parallel distributed systems and software platforms, such as Map-Reduce and Spark, that make possible their scalable utilization. However, when complex data mining algorithms are required, their fully scalable deployment on such platforms faces a number of technical challenges that grow with the complexity of the algorithms involved. Thus algorithms, that were originally designed for a sequential nature, must often be redesigned in order to effectively use the distributed computational resources. In this paper, we explore these problems, and then propose a solution which has proven to be very effective on the complex hierarchical clustering algorithm CLUBS+. By using four stages of successive refinements, CLUBS+ delivers high-quality clusters of data grouped around their centroids, working in a totally unsupervised fashion. Experimental results confirm the accuracy and scalability of CLUBS+.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121110651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}