N. Derbas, F. Segond, Muntsa Padró, Emmanuelle Dusserre, Teodora Dobre, S. Monaci, Gustavo Mastrobuoni
Internet and Social media are widely used by terrorist organizations to spread their ideas and recruit foreign fighters. The aim of SAFFRON project is to build a system able to support early detection of foreign fighters' recruitment by terrorist groups in Europe. It consists in studying recruitment communication strategies on social media (e.g. narrations, argumentative tropes and myths used), and their evolution in time, as well as in identifying needs, values, cultural and social contexts of the target groups (young foreign fighters). In this paper, we will describe Safapp, the application developed to support semantic analysis of social network. We focus on how SAFFRON makes use of natural language processing and machine learning to categorize and analyse messages dealing with recruitment and radicalization on social networks.
{"title":"Semantic Analysis Supporting De-Radicalisation","authors":"N. Derbas, F. Segond, Muntsa Padró, Emmanuelle Dusserre, Teodora Dobre, S. Monaci, Gustavo Mastrobuoni","doi":"10.1109/DEXA.2017.20","DOIUrl":"https://doi.org/10.1109/DEXA.2017.20","url":null,"abstract":"Internet and Social media are widely used by terrorist organizations to spread their ideas and recruit foreign fighters. The aim of SAFFRON project is to build a system able to support early detection of foreign fighters' recruitment by terrorist groups in Europe. It consists in studying recruitment communication strategies on social media (e.g. narrations, argumentative tropes and myths used), and their evolution in time, as well as in identifying needs, values, cultural and social contexts of the target groups (young foreign fighters). In this paper, we will describe Safapp, the application developed to support semantic analysis of social network. We focus on how SAFFRON makes use of natural language processing and machine learning to categorize and analyse messages dealing with recruitment and radicalization on social networks.","PeriodicalId":127009,"journal":{"name":"2017 28th International Workshop on Database and Expert Systems Applications (DEXA)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123979569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Protein-protein interactions (PPI) occur at every level of cell functions. The identification of protein interactions provides a global picture of cellular functions and biological processes. It is also an essential step in the construction of PPI networks for human and other organisms. PPI prediction has been considered a promising alternative to the traditional drug design techniques. The identification of possible viral-host protein interaction can lead to a better understanding of infection mechanisms and, in turn, to the development of several medication drugs and treatment optimization. Several physiochemical experimental techniques have been applied to identify PPIs. However, these techniques are computationally expensive, significantly time consuming, and have covered only a small portion of the complete PPI networks. As a result, the need for computational techniques has been increased to validate experimental results and to predict non-discovered PPIs. This paper investigates and compares the recent computational PPI prediction approaches and discusses the technical challenges in this domain.
{"title":"Protein-Protein Interaction Prediction: Recent Advances","authors":"M. Shatnawi","doi":"10.1109/DEXA.2017.30","DOIUrl":"https://doi.org/10.1109/DEXA.2017.30","url":null,"abstract":"Protein-protein interactions (PPI) occur at every level of cell functions. The identification of protein interactions provides a global picture of cellular functions and biological processes. It is also an essential step in the construction of PPI networks for human and other organisms. PPI prediction has been considered a promising alternative to the traditional drug design techniques. The identification of possible viral-host protein interaction can lead to a better understanding of infection mechanisms and, in turn, to the development of several medication drugs and treatment optimization. Several physiochemical experimental techniques have been applied to identify PPIs. However, these techniques are computationally expensive, significantly time consuming, and have covered only a small portion of the complete PPI networks. As a result, the need for computational techniques has been increased to validate experimental results and to predict non-discovered PPIs. This paper investigates and compares the recent computational PPI prediction approaches and discusses the technical challenges in this domain.","PeriodicalId":127009,"journal":{"name":"2017 28th International Workshop on Database and Expert Systems Applications (DEXA)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127741368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eva C. Serrano Balderas, Laure Berti-Équille, Maria Aurora Armienta Hernandez, C. Grac
In many biological studies, statistical and data mining methods are extensively used to analyze the data and discover actionable knowledge. But, bad data quality causing incorrect analysis results and wrong interpretations may induce misleading conclusions and inadequate decisions. To ensure the validity of the results, avoid bias and data misuse, it is necessary to control not only the whole analytical pipeline, but most importantly the quality of the data with appropriate data preprocessing choices. Since various preprocessing techniques and alternative strategies may lead to dramatically different outputs, it is crucial to rely on a principled and rigorous method to select the optimal set of data preprocessing steps that depends both on the input data distributional characteristics and on the inherent characteristics of the targeted statistical or data mining methods. In this paper, we propose a method that selects, given a dataset, the optimal set of preprocessing tasks to apply to the data such that the overall data preprocessing output maximizes the quality of the analytical results for various techniques of clustering, regression, and classification. We present some promising results that validate our approach on biomonitoring data preparation.
{"title":"Principled Data Preprocessing: Application to Biological Aquatic Indicators of Water Pollution","authors":"Eva C. Serrano Balderas, Laure Berti-Équille, Maria Aurora Armienta Hernandez, C. Grac","doi":"10.1109/DEXA.2017.27","DOIUrl":"https://doi.org/10.1109/DEXA.2017.27","url":null,"abstract":"In many biological studies, statistical and data mining methods are extensively used to analyze the data and discover actionable knowledge. But, bad data quality causing incorrect analysis results and wrong interpretations may induce misleading conclusions and inadequate decisions. To ensure the validity of the results, avoid bias and data misuse, it is necessary to control not only the whole analytical pipeline, but most importantly the quality of the data with appropriate data preprocessing choices. Since various preprocessing techniques and alternative strategies may lead to dramatically different outputs, it is crucial to rely on a principled and rigorous method to select the optimal set of data preprocessing steps that depends both on the input data distributional characteristics and on the inherent characteristics of the targeted statistical or data mining methods. In this paper, we propose a method that selects, given a dataset, the optimal set of preprocessing tasks to apply to the data such that the overall data preprocessing output maximizes the quality of the analytical results for various techniques of clustering, regression, and classification. We present some promising results that validate our approach on biomonitoring data preparation.","PeriodicalId":127009,"journal":{"name":"2017 28th International Workshop on Database and Expert Systems Applications (DEXA)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134029696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Oxford Nanopore and Pacbio SMRT sequencing technologies has revolutionized the Next-Generation Sequencing (NGS) environment by producing long reads that exceed 60 kbp and helped to the completion of many biological projects. But, long reads are characterized by a high error rate which increases the difficulty of biological problems like the genome assembly problem. Error correction of long reads has become a challenge for bioinformaticians, which motivates the development of new approaches for error correction adapted to NGS technologies. In this paper, we present a new denovo self-error correction algorithm using only long reads. Our algorithm operates in two steps: First, we use a fast hashing method which allows to find alignments between the longest reads and other reads in a set of long reads. Next, we use the longest reads as seeds to obtain the final alignment of long reads by using a dynamic programming algorithm in a band of width w. Our error correction algorithm does not require high quality reads, in contrast to existing hybrid error correction ones.
{"title":"An Error Correction Algorithm for NGS Data","authors":"M. Kchouk, J. Gibrat, M. Elloumi","doi":"10.1109/DEXA.2017.33","DOIUrl":"https://doi.org/10.1109/DEXA.2017.33","url":null,"abstract":"The Oxford Nanopore and Pacbio SMRT sequencing technologies has revolutionized the Next-Generation Sequencing (NGS) environment by producing long reads that exceed 60 kbp and helped to the completion of many biological projects. But, long reads are characterized by a high error rate which increases the difficulty of biological problems like the genome assembly problem. Error correction of long reads has become a challenge for bioinformaticians, which motivates the development of new approaches for error correction adapted to NGS technologies. In this paper, we present a new denovo self-error correction algorithm using only long reads. Our algorithm operates in two steps: First, we use a fast hashing method which allows to find alignments between the longest reads and other reads in a set of long reads. Next, we use the longest reads as seeds to obtain the final alignment of long reads by using a dynamic programming algorithm in a band of width w. Our error correction algorithm does not require high quality reads, in contrast to existing hybrid error correction ones.","PeriodicalId":127009,"journal":{"name":"2017 28th International Workshop on Database and Expert Systems Applications (DEXA)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126184975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Acquiring stories and narratives about past periods is a challenge for cultural heritage preservation. In this context, we present a method to obtain from the web a corpus of texts related to the period of 1945-1975 in Luxembourg. Extracted texts are accompanied by meta-data that facilitate their integration by tier applications. As a use-case, this corpus will be used in a software that aims at helping elderly people to recall and share anecdotal stories about this period.
{"title":"A Corpus of Narratives Related to Luxembourg for the Period 1945-1975","authors":"O. Parisot, T. Tamisier","doi":"10.1109/DEXA.2017.39","DOIUrl":"https://doi.org/10.1109/DEXA.2017.39","url":null,"abstract":"Acquiring stories and narratives about past periods is a challenge for cultural heritage preservation. In this context, we present a method to obtain from the web a corpus of texts related to the period of 1945-1975 in Luxembourg. Extracted texts are accompanied by meta-data that facilitate their integration by tier applications. As a use-case, this corpus will be used in a software that aims at helping elderly people to recall and share anecdotal stories about this period.","PeriodicalId":127009,"journal":{"name":"2017 28th International Workshop on Database and Expert Systems Applications (DEXA)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124905088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this study we try to identify extreme adopters on a discussion forum using machine learning. An extreme adopter is a user that has adopted a high level of a community-specific jargon and therefore can be seen as a user that has a high degree of identification with the community. The dataset that we consider consists of a Swedish xenophobic discussion forum where we use a machine learning approach to identify extreme adopters using a number of linguistic features that are independent on the dataset and the community. The results indicates that it is possible to separate these extreme adopters from the rest of the discussants on the discussion forum with more than 80% accuracy. Since the linguistic features that we use are highly domain independent, the results indicates that there is a possibility to use this kind of techniques to identify extreme adopters within other communities as well.
{"title":"A Machine Learning Approach towards Detecting Extreme Adopters in Digital Communities","authors":"A. Shrestha, Lisa Kaati, Katie Cohen","doi":"10.1109/DEXA.2017.17","DOIUrl":"https://doi.org/10.1109/DEXA.2017.17","url":null,"abstract":"In this study we try to identify extreme adopters on a discussion forum using machine learning. An extreme adopter is a user that has adopted a high level of a community-specific jargon and therefore can be seen as a user that has a high degree of identification with the community. The dataset that we consider consists of a Swedish xenophobic discussion forum where we use a machine learning approach to identify extreme adopters using a number of linguistic features that are independent on the dataset and the community. The results indicates that it is possible to separate these extreme adopters from the rest of the discussants on the discussion forum with more than 80% accuracy. Since the linguistic features that we use are highly domain independent, the results indicates that there is a possibility to use this kind of techniques to identify extreme adopters within other communities as well.","PeriodicalId":127009,"journal":{"name":"2017 28th International Workshop on Database and Expert Systems Applications (DEXA)","volume":"22 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123136307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The ability to accurately detect opinion expression in a document is an essential and fundamental task in opinion mining. In this work, we consider opinion expression detection as a sequence labeling task. We describe deep neural network frameworks that consist of convolutional neural networks (CNNs) and bidirectional gated units (Bi-GRUs). CNNs are capable of capturing local features in a sequence, while Bi-GRUs, a type of recurrent neural network (RNN) variant, are able to extract features from sequence data. The properties of these two networks provide the framework to effectively detect opinion expression. Experimental results show that our methods significantly outperform traditional methods like conditional random field (CRF) and previous state-of-the-art deep RNN methods.
{"title":"Opinion Expression Detection via Deep Bidirectional C-GRUs","authors":"Xiaoxia Xie","doi":"10.1109/DEXA.2017.40","DOIUrl":"https://doi.org/10.1109/DEXA.2017.40","url":null,"abstract":"The ability to accurately detect opinion expression in a document is an essential and fundamental task in opinion mining. In this work, we consider opinion expression detection as a sequence labeling task. We describe deep neural network frameworks that consist of convolutional neural networks (CNNs) and bidirectional gated units (Bi-GRUs). CNNs are capable of capturing local features in a sequence, while Bi-GRUs, a type of recurrent neural network (RNN) variant, are able to extract features from sequence data. The properties of these two networks provide the framework to effectively detect opinion expression. Experimental results show that our methods significantly outperform traditional methods like conditional random field (CRF) and previous state-of-the-art deep RNN methods.","PeriodicalId":127009,"journal":{"name":"2017 28th International Workshop on Database and Expert Systems Applications (DEXA)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130888513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Ordonez, T. Johnson, D. Srivastava, Simon Urbanek
Due to advances in parallel file systems for big data (i.e. HDFS) and larger capacity hardware (multicore CPUs, large RAM) it is now feasible to manage and query network data in a parallel DBMS supporting SQL, but performing statistical analysis remains a challenge.On the statistics side, the R language is popular, but it presents important limitations: R is limited by main memory, R works in a different address space from query processing, R cannot analyze large disk-resident data sets efficiently, and R has no data management capabilities. Moreover, some R libraries allow R to work in parallel, but without data management capabilities. Considering the challenges and limitations described above, we present a system that allows combining SQL queries and R functions in a seamless manner. We justify a parallel DBMS and the R runtime are two different systems that benefit from a low-level integration. Our parallel DBMS is built on top of HDFS, programmed in Java and C++, with a flexible scale out architecture, whereas R is programmed purely in C. The user or developer can make calls in both directions: (1) R calling SQL, to evaluate analytic queries or retrieve data from materialized views (transferring result tables in RAM in a streaming fashion and analyzing them in R), and vice-versa (2) SQL calling R, allowing SQL to convert relational tables to matrices or vectors and making complex computations on them. We give a summary of network monitoring tasks at ATT and present specific programming examples, showing language calls in both directions (i.e. R calls SQL, SQL calls R).
{"title":"A Tool for Statistical Analysis on Network Big Data","authors":"C. Ordonez, T. Johnson, D. Srivastava, Simon Urbanek","doi":"10.1109/DEXA.2017.23","DOIUrl":"https://doi.org/10.1109/DEXA.2017.23","url":null,"abstract":"Due to advances in parallel file systems for big data (i.e. HDFS) and larger capacity hardware (multicore CPUs, large RAM) it is now feasible to manage and query network data in a parallel DBMS supporting SQL, but performing statistical analysis remains a challenge.On the statistics side, the R language is popular, but it presents important limitations: R is limited by main memory, R works in a different address space from query processing, R cannot analyze large disk-resident data sets efficiently, and R has no data management capabilities. Moreover, some R libraries allow R to work in parallel, but without data management capabilities. Considering the challenges and limitations described above, we present a system that allows combining SQL queries and R functions in a seamless manner. We justify a parallel DBMS and the R runtime are two different systems that benefit from a low-level integration. Our parallel DBMS is built on top of HDFS, programmed in Java and C++, with a flexible scale out architecture, whereas R is programmed purely in C. The user or developer can make calls in both directions: (1) R calling SQL, to evaluate analytic queries or retrieve data from materialized views (transferring result tables in RAM in a streaming fashion and analyzing them in R), and vice-versa (2) SQL calling R, allowing SQL to convert relational tables to matrices or vectors and making complex computations on them. We give a summary of network monitoring tasks at ATT and present specific programming examples, showing language calls in both directions (i.e. R calls SQL, SQL calls R).","PeriodicalId":127009,"journal":{"name":"2017 28th International Workshop on Database and Expert Systems Applications (DEXA)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129027907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This short empirical paper investigates how well topic modeling and database meta-data characteristics can classify web and other proof-of-concept (PoC) exploits for publicly disclosed software vulnerabilities. By using a dataset comprised of over 36 thousand PoC exploits, near a 0.9 accuracy rate is obtained in the empirical experiment. Text mining and topic modeling are a significant boost factor behind this classification performance. In addition to these empirical results, the paper contributes to the research tradition of enhancing software vulnerability information with text mining, providing also a few scholarly observations about the potential for semi-automatic classification of exploits in the existing tracking infrastructures.
{"title":"Classifying Web Exploits with Topic Modeling","authors":"Jukka Ruohonen","doi":"10.1109/DEXA.2017.35","DOIUrl":"https://doi.org/10.1109/DEXA.2017.35","url":null,"abstract":"This short empirical paper investigates how well topic modeling and database meta-data characteristics can classify web and other proof-of-concept (PoC) exploits for publicly disclosed software vulnerabilities. By using a dataset comprised of over 36 thousand PoC exploits, near a 0.9 accuracy rate is obtained in the empirical experiment. Text mining and topic modeling are a significant boost factor behind this classification performance. In addition to these empirical results, the paper contributes to the research tradition of enhancing software vulnerability information with text mining, providing also a few scholarly observations about the potential for semi-automatic classification of exploits in the existing tracking infrastructures.","PeriodicalId":127009,"journal":{"name":"2017 28th International Workshop on Database and Expert Systems Applications (DEXA)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117351338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Knowledge processing systems recently regained attention in the context of big "knowledge" processing and cloud platforms. Therefore, the development of such systems with a high software quality has to be ensured. In this paper an approach to contribute to an architectural guideline for developing such systems using the concept of design patterns is shown. The need, as well as current research in this domain is presented. Further, possible design pattern candidates are introduced that have been extracted from literature.
{"title":"Introducing Design Patterns to Knowledge Processing Systems in the Context of Big Data and Cloud Platforms","authors":"Stefan Nadschläger","doi":"10.1109/DEXA.2017.26","DOIUrl":"https://doi.org/10.1109/DEXA.2017.26","url":null,"abstract":"Knowledge processing systems recently regained attention in the context of big \"knowledge\" processing and cloud platforms. Therefore, the development of such systems with a high software quality has to be ensured. In this paper an approach to contribute to an architectural guideline for developing such systems using the concept of design patterns is shown. The need, as well as current research in this domain is presented. Further, possible design pattern candidates are introduced that have been extracted from literature.","PeriodicalId":127009,"journal":{"name":"2017 28th International Workshop on Database and Expert Systems Applications (DEXA)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121728124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}