There have been numerous studies that have examined the performance of distribution frameworks. Most of these studies deal with the processing of large amounts of data. This work compares two of these frameworks for their ability to implement CPU-intensive distributed algorithms. As a case study for our experiments we used a simple but computationally intensive puzzle. To find all solutions using brute-force search, 15! permutations had to be calculated and tested against the solution rules. Our experimental application was implemented in the Java programming language using a simple algorithm and having two distributed solutions with the paradigms MapReduce (Apache Hadoop) and RDD (Apache Spark). The implementations were benchmarked in Amazon-EC2/EMR clusters for performance and scalability measurements, where the processing time of both solutions scaled approximately linearly. However, according to our experiments, the number of tasks, hardware utilization and other aspects should also be taken into consideration when assessing scalability. The comparison of the solutions with MapReduce (Apache Hadoop) and RDD (Apache Spark) under Amazon EMR showed that the processing time measured in CPU minutes with Spark was up to 30 % lower, while the performance of Spark especially benefits from an increasing number of tasks. Considering the efficiency of using the EC2 resources, the implementation via Apache Spark was even more powerful than a comparable multithreaded Java solution.
{"title":"Performance evaluation of Apache Hadoop and Apache Spark for parallelization of compute-intensive tasks","authors":"Alexander Döschl, Max-Emanuel Keller, P. Mandl","doi":"10.1145/3428757.3429121","DOIUrl":"https://doi.org/10.1145/3428757.3429121","url":null,"abstract":"There have been numerous studies that have examined the performance of distribution frameworks. Most of these studies deal with the processing of large amounts of data. This work compares two of these frameworks for their ability to implement CPU-intensive distributed algorithms. As a case study for our experiments we used a simple but computationally intensive puzzle. To find all solutions using brute-force search, 15! permutations had to be calculated and tested against the solution rules. Our experimental application was implemented in the Java programming language using a simple algorithm and having two distributed solutions with the paradigms MapReduce (Apache Hadoop) and RDD (Apache Spark). The implementations were benchmarked in Amazon-EC2/EMR clusters for performance and scalability measurements, where the processing time of both solutions scaled approximately linearly. However, according to our experiments, the number of tasks, hardware utilization and other aspects should also be taken into consideration when assessing scalability. The comparison of the solutions with MapReduce (Apache Hadoop) and RDD (Apache Spark) under Amazon EMR showed that the processing time measured in CPU minutes with Spark was up to 30 % lower, while the performance of Spark especially benefits from an increasing number of tasks. Considering the efficiency of using the EC2 resources, the implementation via Apache Spark was even more powerful than a comparable multithreaded Java solution.","PeriodicalId":212557,"journal":{"name":"Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117055295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Škoda, D. Bernhauer, M. Nečaský, Jakub Klímek, T. Skopal
Many institutions publish datasets as Open Data in catalogs, however, their retrieval remains problematic issue due to the absence of dataset search benchmarking. We propose a framework for evaluating findability of datasets, regardless of retrieval models used. As task-agnostic labeling of datasets by ground truth turns out to be infeasible in the general domain of open data datasets, the proposed framework is based on evaluation of entire retrieval scenarios that mimic complex retrieval tasks. In addition to the framework we present a proof of concept specification and evaluation on several similarity-based retrieval models and several dataset discovery scenarios within a catalog, using our experimental evaluation tool. Instead of traditional matching of query with metadata of all the datasets, in similarity-based retrieval the query is formulated using a set of datasets (query by example) and the most similar datasets to the query set are retrieved from the catalog as a result.
{"title":"Evaluation Framework for Search Methods Focused on Dataset Findability in Open Data Catalogs","authors":"P. Škoda, D. Bernhauer, M. Nečaský, Jakub Klímek, T. Skopal","doi":"10.1145/3428757.3429973","DOIUrl":"https://doi.org/10.1145/3428757.3429973","url":null,"abstract":"Many institutions publish datasets as Open Data in catalogs, however, their retrieval remains problematic issue due to the absence of dataset search benchmarking. We propose a framework for evaluating findability of datasets, regardless of retrieval models used. As task-agnostic labeling of datasets by ground truth turns out to be infeasible in the general domain of open data datasets, the proposed framework is based on evaluation of entire retrieval scenarios that mimic complex retrieval tasks. In addition to the framework we present a proof of concept specification and evaluation on several similarity-based retrieval models and several dataset discovery scenarios within a catalog, using our experimental evaluation tool. Instead of traditional matching of query with metadata of all the datasets, in similarity-based retrieval the query is formulated using a set of datasets (query by example) and the most similar datasets to the query set are retrieved from the catalog as a result.","PeriodicalId":212557,"journal":{"name":"Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122142004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recommender systems are now part of our daily life more than ever and most users are confronted with some form of recommendation on a daily basis. As users of such systems, we don't need to actively seek for new content, but let it be comfortably recommended to us instead. One of the important parts of our lives that is yet to be covered in this way is the domain of cooking. A traditional dilemma of a person, who is currently in the process of shopping for food is "What else should I buy, so that I can cook something new?" In another words, the person either has to look for novel recipes upfront (which does not have to correspond with available ingredients in the shop), or buy ingredients intuitively (which does not have to correspond with recipes). The main objective of this paper is to bind cooking and shopping activities together via a mobile recipes recommendation application. The application responds on the content of a user's shopping list and strives for calibration of recommended recipes. In an online user study, we also show that calibrated recommendations outperform both diversity enhanced and plain similarity-based recommendations.
{"title":"SmartRecepies","authors":"Josef Starychfojtu, Ladislav Peška","doi":"10.1145/3428757.3429096","DOIUrl":"https://doi.org/10.1145/3428757.3429096","url":null,"abstract":"Recommender systems are now part of our daily life more than ever and most users are confronted with some form of recommendation on a daily basis. As users of such systems, we don't need to actively seek for new content, but let it be comfortably recommended to us instead. One of the important parts of our lives that is yet to be covered in this way is the domain of cooking. A traditional dilemma of a person, who is currently in the process of shopping for food is \"What else should I buy, so that I can cook something new?\" In another words, the person either has to look for novel recipes upfront (which does not have to correspond with available ingredients in the shop), or buy ingredients intuitively (which does not have to correspond with recipes). The main objective of this paper is to bind cooking and shopping activities together via a mobile recipes recommendation application. The application responds on the content of a user's shopping list and strives for calibration of recommended recipes. In an online user study, we also show that calibrated recommendations outperform both diversity enhanced and plain similarity-based recommendations.","PeriodicalId":212557,"journal":{"name":"Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services","volume":"191 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133124913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reviews are used in generating explainable recommendation. However, the use of reviews has so far not been adequately addressed. In this paper, we examine methods that make use of reviews effectively. There is a trade-off between the number and quality of reviews to use, that is, we should like to use reviews as many as possible to generate explainable recommendation, however in a large number of reviews there can be low quality ones, which can cause low quality explainable recommendation generation. We discuss new methods that use not only reviews written by a user but also those utilized by the user to generate good explainable recommendation. Our methods can be applied to different explainable recommender approaches, which is shown by adopting two state-of-the-art explainable recommender approaches in this paper. Experimental results demonstrate that our methods can be of benefit to existing explainable recommender approaches as regards both recommendation and its explanation qualities.
{"title":"Making Use of Reviews for Good Explainable Recommendation","authors":"Shunsuke Kido, Ryuji Sakamoto, M. Aritsugi","doi":"10.1145/3428757.3429125","DOIUrl":"https://doi.org/10.1145/3428757.3429125","url":null,"abstract":"Reviews are used in generating explainable recommendation. However, the use of reviews has so far not been adequately addressed. In this paper, we examine methods that make use of reviews effectively. There is a trade-off between the number and quality of reviews to use, that is, we should like to use reviews as many as possible to generate explainable recommendation, however in a large number of reviews there can be low quality ones, which can cause low quality explainable recommendation generation. We discuss new methods that use not only reviews written by a user but also those utilized by the user to generate good explainable recommendation. Our methods can be applied to different explainable recommender approaches, which is shown by adopting two state-of-the-art explainable recommender approaches in this paper. Experimental results demonstrate that our methods can be of benefit to existing explainable recommender approaches as regards both recommendation and its explanation qualities.","PeriodicalId":212557,"journal":{"name":"Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131238221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we focus on the problems of fair aggregation of recommender systems (RS) and over-exposure of users with insignificant recommendations. While fair aggregation of diverse RS may contribute to both calibration and diversity challenges, some recently proposed methods suffer from repeating the same set of recommendations to the user over and over again. However, it may be difficult to distinguish between situations when users ignore recommendations because they are irrelevant or because they did not notice them. In order to cope with these challenges, we propose an innovative off-line RS evaluation methodology based on the noticeability of recommended items. We further propose a Fuzzy D'Hondt's algorithm with personalized implicit negative feedback attribution (FDHondtINF). The algorithm is designed to provide a fair ordering of candidate items coming from multiple individual RS, while considering also the objects previously ignored by the current user. FDHondtINF was evaluated off-line along with other aggregation methods and individual RS on MovieLens 1M dataset. The algorithm performs especially well in situations when the recommended items are less noticeable, or when a sequence of multiple recommendations for the same user model is given.
{"title":"Personalized Implicit Negative Feedback Enhancements for Fuzzy D'Hondt's Recommendation Aggregations","authors":"Stepán Balcar, Ladislav Peška","doi":"10.1145/3428757.3429105","DOIUrl":"https://doi.org/10.1145/3428757.3429105","url":null,"abstract":"In this paper, we focus on the problems of fair aggregation of recommender systems (RS) and over-exposure of users with insignificant recommendations. While fair aggregation of diverse RS may contribute to both calibration and diversity challenges, some recently proposed methods suffer from repeating the same set of recommendations to the user over and over again. However, it may be difficult to distinguish between situations when users ignore recommendations because they are irrelevant or because they did not notice them. In order to cope with these challenges, we propose an innovative off-line RS evaluation methodology based on the noticeability of recommended items. We further propose a Fuzzy D'Hondt's algorithm with personalized implicit negative feedback attribution (FDHondtINF). The algorithm is designed to provide a fair ordering of candidate items coming from multiple individual RS, while considering also the objects previously ignored by the current user. FDHondtINF was evaluated off-line along with other aggregation methods and individual RS on MovieLens 1M dataset. The algorithm performs especially well in situations when the recommended items are less noticeable, or when a sequence of multiple recommendations for the same user model is given.","PeriodicalId":212557,"journal":{"name":"Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services","volume":"223 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122702155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Federations of RDF data sources provide great potential when queried for answers and insights that cannot be obtained from one data source alone. A challenge for planning the execution of queries over such a federation is that the federation may be heterogeneous in terms of the types of data access interfaces provided by the federation members. This challenge has not received much attention in the literature. This paper provides a solid formal foundation for future approaches that aim to address this challenge. Our main conceptual contribution is a formal language for representing query execution plans; additionally, we identify a fragment of this language that can be used to capture the result of selecting relevant data sources for different parts of a given query. As technical contributions, we show that this fragment is more expressive than what is supported by existing source selection approaches, which effectively highlights an inherent limitation of these approaches. Moreover, we show that the source selection problem is NP-hard and in σP2, and we provide an extensive set of rewriting rules that can be used as a basis for query optimization.
{"title":"FedQPL: A Language for Logical Query Plans over Heterogeneous Federations of RDF Data Sources","authors":"Sijin Cheng, O. Hartig","doi":"10.1145/3428757.3429120","DOIUrl":"https://doi.org/10.1145/3428757.3429120","url":null,"abstract":"Federations of RDF data sources provide great potential when queried for answers and insights that cannot be obtained from one data source alone. A challenge for planning the execution of queries over such a federation is that the federation may be heterogeneous in terms of the types of data access interfaces provided by the federation members. This challenge has not received much attention in the literature. This paper provides a solid formal foundation for future approaches that aim to address this challenge. Our main conceptual contribution is a formal language for representing query execution plans; additionally, we identify a fragment of this language that can be used to capture the result of selecting relevant data sources for different parts of a given query. As technical contributions, we show that this fragment is more expressive than what is supported by existing source selection approaches, which effectively highlights an inherent limitation of these approaches. Moreover, we show that the source selection problem is NP-hard and in σP2, and we provide an extensive set of rewriting rules that can be used as a basis for query optimization.","PeriodicalId":212557,"journal":{"name":"Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125438801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Soumya Suvra Ghosal, P. Deepak, Anna Jurek-Loughrey
Disinformation is often presented in long textual articles, especially when it relates to domains such as health, often seen in relation to COVID-19. These articles are typically observed to have a number of trustworthy sentences among which core disinformation sentences are scattered. In this paper, we propose a novel unsupervised task of identifying sentences containing key disinformation within a document that is known to be untrustworthy. We design a three-phase statistical NLP solution for the task which starts with embedding sentences within a bespoke feature space designed for the task. Sentences represented using those features are then clustered, following which the key sentences are identified through proximity scoring. We also curate a new dataset with sentence level disinformation scorings to aid evaluation for this task; the dataset is being made publicly available to facilitate further research. Based on a comprehensive empirical evaluation against techniques from related tasks such as claim detection and summarization, as well as against simplified variants of our proposed approach, we illustrate that our method is able to identify core disinformation effectively.
{"title":"ReSCo-CC: Unsupervised Identification of Key Disinformation Sentences","authors":"Soumya Suvra Ghosal, P. Deepak, Anna Jurek-Loughrey","doi":"10.1145/3428757.3429107","DOIUrl":"https://doi.org/10.1145/3428757.3429107","url":null,"abstract":"Disinformation is often presented in long textual articles, especially when it relates to domains such as health, often seen in relation to COVID-19. These articles are typically observed to have a number of trustworthy sentences among which core disinformation sentences are scattered. In this paper, we propose a novel unsupervised task of identifying sentences containing key disinformation within a document that is known to be untrustworthy. We design a three-phase statistical NLP solution for the task which starts with embedding sentences within a bespoke feature space designed for the task. Sentences represented using those features are then clustered, following which the key sentences are identified through proximity scoring. We also curate a new dataset with sentence level disinformation scorings to aid evaluation for this task; the dataset is being made publicly available to facilitate further research. Based on a comprehensive empirical evaluation against techniques from related tasks such as claim detection and summarization, as well as against simplified variants of our proposed approach, we illustrate that our method is able to identify core disinformation effectively.","PeriodicalId":212557,"journal":{"name":"Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122364498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}