2017 28th International Workshop on Database and Expert Systems Applications (DEXA)最新文献

英文中文

Semantic Analysis Supporting De-Radicalisation 支持去激进化的语义分析

2017 28th International Workshop on Database and Expert Systems Applications (DEXA)

Pub Date : 2017-08-01 DOI: 10.1109/DEXA.2017.20

N. Derbas, F. Segond, Muntsa Padró, Emmanuelle Dusserre, Teodora Dobre, S. Monaci, Gustavo Mastrobuoni

Internet and Social media are widely used by terrorist organizations to spread their ideas and recruit foreign fighters. The aim of SAFFRON project is to build a system able to support early detection of foreign fighters' recruitment by terrorist groups in Europe. It consists in studying recruitment communication strategies on social media (e.g. narrations, argumentative tropes and myths used), and their evolution in time, as well as in identifying needs, values, cultural and social contexts of the target groups (young foreign fighters). In this paper, we will describe Safapp, the application developed to support semantic analysis of social network. We focus on how SAFFRON makes use of natural language processing and machine learning to categorize and analyse messages dealing with recruitment and radicalization on social networks.

恐怖组织广泛利用互联网和社交媒体来传播他们的思想和招募外国战士。SAFFRON项目的目标是建立一个能够支持早期发现欧洲恐怖组织招募外国战斗人员的系统。它包括研究社交媒体上的招募传播策略(例如叙述，使用的论证性比喻和神话)及其随时间的演变，以及确定目标群体(年轻的外国战士)的需求，价值观，文化和社会背景。在本文中，我们将描述Safapp，一个用于支持社交网络语义分析的应用程序。我们专注于SAFFRON如何利用自然语言处理和机器学习对社交网络上涉及招聘和激进化的信息进行分类和分析。

引用次数: 4

Protein-Protein Interaction Prediction: Recent Advances 蛋白质-蛋白质相互作用预测:最新进展

2017 28th International Workshop on Database and Expert Systems Applications (DEXA)

Pub Date : 2017-08-01 DOI: 10.1109/DEXA.2017.30

M. Shatnawi

Protein-protein interactions (PPI) occur at every level of cell functions. The identification of protein interactions provides a global picture of cellular functions and biological processes. It is also an essential step in the construction of PPI networks for human and other organisms. PPI prediction has been considered a promising alternative to the traditional drug design techniques. The identification of possible viral-host protein interaction can lead to a better understanding of infection mechanisms and, in turn, to the development of several medication drugs and treatment optimization. Several physiochemical experimental techniques have been applied to identify PPIs. However, these techniques are computationally expensive, significantly time consuming, and have covered only a small portion of the complete PPI networks. As a result, the need for computational techniques has been increased to validate experimental results and to predict non-discovered PPIs. This paper investigates and compares the recent computational PPI prediction approaches and discusses the technical challenges in this domain.

蛋白质-蛋白质相互作用(PPI)发生在细胞功能的各个层面。蛋白质相互作用的鉴定提供了细胞功能和生物过程的全局图像。这也是构建人类和其他生物的PPI网络的重要一步。PPI预测已被认为是传统药物设计技术的一个有前途的替代方案。鉴定可能的病毒-宿主蛋白相互作用可以更好地了解感染机制，从而开发几种药物和优化治疗。几种物理化学实验技术已被应用于鉴定PPIs。然而，这些技术在计算上非常昂贵，非常耗时，并且只覆盖了完整PPI网络的一小部分。因此，对计算技术的需求增加了，以验证实验结果和预测未发现的ppi。本文研究和比较了最近的计算PPI预测方法，并讨论了该领域的技术挑战。

引用次数: 2

Principled Data Preprocessing: Application to Biological Aquatic Indicators of Water Pollution 原则数据预处理:在水污染水生生物指标中的应用

2017 28th International Workshop on Database and Expert Systems Applications (DEXA)

Pub Date : 2017-08-01 DOI: 10.1109/DEXA.2017.27

Eva C. Serrano Balderas, Laure Berti-Équille, Maria Aurora Armienta Hernandez, C. Grac

In many biological studies, statistical and data mining methods are extensively used to analyze the data and discover actionable knowledge. But, bad data quality causing incorrect analysis results and wrong interpretations may induce misleading conclusions and inadequate decisions. To ensure the validity of the results, avoid bias and data misuse, it is necessary to control not only the whole analytical pipeline, but most importantly the quality of the data with appropriate data preprocessing choices. Since various preprocessing techniques and alternative strategies may lead to dramatically different outputs, it is crucial to rely on a principled and rigorous method to select the optimal set of data preprocessing steps that depends both on the input data distributional characteristics and on the inherent characteristics of the targeted statistical or data mining methods. In this paper, we propose a method that selects, given a dataset, the optimal set of preprocessing tasks to apply to the data such that the overall data preprocessing output maximizes the quality of the analytical results for various techniques of clustering, regression, and classification. We present some promising results that validate our approach on biomonitoring data preparation.

在许多生物学研究中，统计和数据挖掘方法被广泛用于分析数据和发现可操作的知识。但是，糟糕的数据质量导致不正确的分析结果和错误的解释可能导致误导性的结论和不充分的决策。为了保证结果的有效性，避免偏差和数据误用，不仅需要控制整个分析管道，最重要的是要通过适当的数据预处理选择来控制数据的质量。由于各种预处理技术和替代策略可能导致截然不同的输出，因此依赖于原则和严格的方法来选择最佳的数据预处理步骤集至关重要，这些步骤取决于输入数据分布特征和目标统计或数据挖掘方法的固有特征。在本文中，我们提出了一种方法，在给定数据集的情况下，选择一组最优的预处理任务来应用于数据，从而使总体数据预处理输出最大化各种聚类、回归和分类技术的分析结果的质量。我们提出了一些有希望的结果，验证了我们在生物监测数据准备方面的方法。

{"title":"Principled Data Preprocessing: Application to Biological Aquatic Indicators of Water Pollution","authors":"Eva C. Serrano Balderas, Laure Berti-Équille, Maria Aurora Armienta Hernandez, C. Grac","doi":"10.1109/DEXA.2017.27","DOIUrl":"https://doi.org/10.1109/DEXA.2017.27","url":null,"abstract":"In many biological studies, statistical and data mining methods are extensively used to analyze the data and discover actionable knowledge. But, bad data quality causing incorrect analysis results and wrong interpretations may induce misleading conclusions and inadequate decisions. To ensure the validity of the results, avoid bias and data misuse, it is necessary to control not only the whole analytical pipeline, but most importantly the quality of the data with appropriate data preprocessing choices. Since various preprocessing techniques and alternative strategies may lead to dramatically different outputs, it is crucial to rely on a principled and rigorous method to select the optimal set of data preprocessing steps that depends both on the input data distributional characteristics and on the inherent characteristics of the targeted statistical or data mining methods. In this paper, we propose a method that selects, given a dataset, the optimal set of preprocessing tasks to apply to the data such that the overall data preprocessing output maximizes the quality of the analytical results for various techniques of clustering, regression, and classification. We present some promising results that validate our approach on biomonitoring data preparation.","PeriodicalId":127009,"journal":{"name":"2017 28th International Workshop on Database and Expert Systems Applications (DEXA)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134029696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

An Error Correction Algorithm for NGS Data 一种NGS数据纠错算法

2017 28th International Workshop on Database and Expert Systems Applications (DEXA)

Pub Date : 2017-08-01 DOI: 10.1109/DEXA.2017.33

M. Kchouk, J. Gibrat, M. Elloumi

The Oxford Nanopore and Pacbio SMRT sequencing technologies has revolutionized the Next-Generation Sequencing (NGS) environment by producing long reads that exceed 60 kbp and helped to the completion of many biological projects. But, long reads are characterized by a high error rate which increases the difficulty of biological problems like the genome assembly problem. Error correction of long reads has become a challenge for bioinformaticians, which motivates the development of new approaches for error correction adapted to NGS technologies. In this paper, we present a new denovo self-error correction algorithm using only long reads. Our algorithm operates in two steps: First, we use a fast hashing method which allows to find alignments between the longest reads and other reads in a set of long reads. Next, we use the longest reads as seeds to obtain the final alignment of long reads by using a dynamic programming algorithm in a band of width w. Our error correction algorithm does not require high quality reads, in contrast to existing hybrid error correction ones.

牛津纳米孔和Pacbio SMRT测序技术通过产生超过60 kbp的长reads，彻底改变了下一代测序(NGS)环境，并帮助完成了许多生物项目。但是，长读取具有高错误率的特点，这增加了基因组组装等生物学问题的难度。长读段的纠错已成为生物信息学家面临的一个挑战，这促使了适应NGS技术的纠错新方法的发展。在本文中，我们提出了一种新的仅使用长读的denovo自纠错算法。我们的算法分两步操作:首先，我们使用快速哈希方法，该方法允许查找长读集合中最长读和其他读之间的对齐。接下来，我们将最长的读取作为种子，通过动态规划算法在宽度为w的频带内获得长读取的最终对齐。与现有的混合纠错算法相比，我们的纠错算法不需要高质量的读取。

引用次数: 1

A Corpus of Narratives Related to Luxembourg for the Period 1945-1975 《1945-1975年与卢森堡有关的叙事文集》

2017 28th International Workshop on Database and Expert Systems Applications (DEXA)

Pub Date : 2017-08-01 DOI: 10.1109/DEXA.2017.39

O. Parisot, T. Tamisier

Acquiring stories and narratives about past periods is a challenge for cultural heritage preservation. In this context, we present a method to obtain from the web a corpus of texts related to the period of 1945-1975 in Luxembourg. Extracted texts are accompanied by meta-data that facilitate their integration by tier applications. As a use-case, this corpus will be used in a software that aims at helping elderly people to recall and share anecdotal stories about this period.

获取关于过去时期的故事和叙述是文化遗产保护的一个挑战。在这种情况下，我们提出了一种方法，从网上获得的文本语料库相关的时期1945-1975年在卢森堡。提取的文本伴随着元数据，这有助于层应用程序对其进行集成。作为一个用例，这个语料库将用于一个旨在帮助老年人回忆和分享这一时期的轶事故事的软件中。

引用次数: 1

A Machine Learning Approach towards Detecting Extreme Adopters in Digital Communities 一种用于检测数字社区极端采用者的机器学习方法

2017 28th International Workshop on Database and Expert Systems Applications (DEXA)

Pub Date : 2017-08-01 DOI: 10.1109/DEXA.2017.17

A. Shrestha, Lisa Kaati, Katie Cohen

In this study we try to identify extreme adopters on a discussion forum using machine learning. An extreme adopter is a user that has adopted a high level of a community-specific jargon and therefore can be seen as a user that has a high degree of identification with the community. The dataset that we consider consists of a Swedish xenophobic discussion forum where we use a machine learning approach to identify extreme adopters using a number of linguistic features that are independent on the dataset and the community. The results indicates that it is possible to separate these extreme adopters from the rest of the discussants on the discussion forum with more than 80% accuracy. Since the linguistic features that we use are highly domain independent, the results indicates that there is a possibility to use this kind of techniques to identify extreme adopters within other communities as well.

在这项研究中，我们试图使用机器学习识别论坛上的极端采用者。极端的采用者是指高度采用特定于社区的术语的用户，因此可以被视为对社区有高度认同的用户。我们考虑的数据集由瑞典仇外讨论论坛组成，我们使用机器学习方法来识别极端的采用者，使用一些独立于数据集和社区的语言特征。结果表明，将这些极端的采用者与论坛上的其他讨论者区分开来是可能的，准确率超过80%。由于我们使用的语言特征是高度独立于领域的，因此结果表明，也有可能使用这种技术来识别其他社区中的极端采用者。

引用次数: 8

A Tool for Statistical Analysis on Network Big Data 网络大数据统计分析工具

2017 28th International Workshop on Database and Expert Systems Applications (DEXA)

Pub Date : 2017-08-01 DOI: 10.1109/DEXA.2017.23

C. Ordonez, T. Johnson, D. Srivastava, Simon Urbanek

Due to advances in parallel file systems for big data (i.e. HDFS) and larger capacity hardware (multicore CPUs, large RAM) it is now feasible to manage and query network data in a parallel DBMS supporting SQL, but performing statistical analysis remains a challenge.On the statistics side, the R language is popular, but it presents important limitations: R is limited by main memory, R works in a different address space from query processing, R cannot analyze large disk-resident data sets efficiently, and R has no data management capabilities. Moreover, some R libraries allow R to work in parallel, but without data management capabilities. Considering the challenges and limitations described above, we present a system that allows combining SQL queries and R functions in a seamless manner. We justify a parallel DBMS and the R runtime are two different systems that benefit from a low-level integration. Our parallel DBMS is built on top of HDFS, programmed in Java and C++, with a flexible scale out architecture, whereas R is programmed purely in C. The user or developer can make calls in both directions: (1) R calling SQL, to evaluate analytic queries or retrieve data from materialized views (transferring result tables in RAM in a streaming fashion and analyzing them in R), and vice-versa (2) SQL calling R, allowing SQL to convert relational tables to matrices or vectors and making complex computations on them. We give a summary of network monitoring tasks at ATT and present specific programming examples, showing language calls in both directions (i.e. R calls SQL, SQL calls R).

由于大数据并行文件系统(例如HDFS)和大容量硬件(多核cpu，大内存)的进步，现在可以在支持SQL的并行DBMS中管理和查询网络数据，但是执行统计分析仍然是一个挑战。在统计方面，R语言很受欢迎，但它存在重要的局限性:R受主存的限制，R工作在与查询处理不同的地址空间中，R不能有效地分析大型磁盘驻留数据集，R没有数据管理能力。此外，一些R库允许R并行工作，但没有数据管理功能。考虑到上面描述的挑战和限制，我们提出了一个允许以无缝方式组合SQL查询和R函数的系统。我们认为并行DBMS和R运行时是两个不同的系统，可以从低级集成中受益。我们的并行DBMS建立在HDFS之上，用Java和c++编程，具有灵活的扩展架构，而R是纯用C编程的。用户或开发人员可以在两个方向上进行调用:(1) R调用SQL，评估分析查询或从物化视图中检索数据(以流方式在RAM中传输结果表并在R中分析它们)，反之亦然(2)SQL调用R，允许SQL将关系表转换为矩阵或向量并在其上进行复杂的计算。我们总结了ATT的网络监控任务，并给出了具体的编程示例，展示了两个方向的语言调用(即R调用SQL, SQL调用R)。

{"title":"A Tool for Statistical Analysis on Network Big Data","authors":"C. Ordonez, T. Johnson, D. Srivastava, Simon Urbanek","doi":"10.1109/DEXA.2017.23","DOIUrl":"https://doi.org/10.1109/DEXA.2017.23","url":null,"abstract":"Due to advances in parallel file systems for big data (i.e. HDFS) and larger capacity hardware (multicore CPUs, large RAM) it is now feasible to manage and query network data in a parallel DBMS supporting SQL, but performing statistical analysis remains a challenge.On the statistics side, the R language is popular, but it presents important limitations: R is limited by main memory, R works in a different address space from query processing, R cannot analyze large disk-resident data sets efficiently, and R has no data management capabilities. Moreover, some R libraries allow R to work in parallel, but without data management capabilities. Considering the challenges and limitations described above, we present a system that allows combining SQL queries and R functions in a seamless manner. We justify a parallel DBMS and the R runtime are two different systems that benefit from a low-level integration. Our parallel DBMS is built on top of HDFS, programmed in Java and C++, with a flexible scale out architecture, whereas R is programmed purely in C. The user or developer can make calls in both directions: (1) R calling SQL, to evaluate analytic queries or retrieve data from materialized views (transferring result tables in RAM in a streaming fashion and analyzing them in R), and vice-versa (2) SQL calling R, allowing SQL to convert relational tables to matrices or vectors and making complex computations on them. We give a summary of network monitoring tasks at ATT and present specific programming examples, showing language calls in both directions (i.e. R calls SQL, SQL calls R).","PeriodicalId":127009,"journal":{"name":"2017 28th International Workshop on Database and Expert Systems Applications (DEXA)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129027907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Opinion Expression Detection via Deep Bidirectional C-GRUs 基于深度双向c - gru的意见表达检测

2017 28th International Workshop on Database and Expert Systems Applications (DEXA)

Pub Date : 2017-08-01 DOI: 10.1109/DEXA.2017.40

Xiaoxia Xie

The ability to accurately detect opinion expression in a document is an essential and fundamental task in opinion mining. In this work, we consider opinion expression detection as a sequence labeling task. We describe deep neural network frameworks that consist of convolutional neural networks (CNNs) and bidirectional gated units (Bi-GRUs). CNNs are capable of capturing local features in a sequence, while Bi-GRUs, a type of recurrent neural network (RNN) variant, are able to extract features from sequence data. The properties of these two networks provide the framework to effectively detect opinion expression. Experimental results show that our methods significantly outperform traditional methods like conditional random field (CRF) and previous state-of-the-art deep RNN methods.

在意见挖掘中，准确地检测意见表达是一项必不可少的基础任务。在这项工作中，我们将意见表达检测视为一个序列标记任务。我们描述了由卷积神经网络(cnn)和双向门控单元(bi - gru)组成的深度神经网络框架。cnn能够捕获序列中的局部特征，而bi - gru，一种递归神经网络(RNN)变体，能够从序列数据中提取特征。这两种网络的特性为有效检测意见表达提供了框架。实验结果表明，我们的方法明显优于传统的方法，如条件随机场(CRF)和以前最先进的深度RNN方法。

引用次数: 4

Classifying Web Exploits with Topic Modeling 利用主题建模对Web漏洞进行分类

2017 28th International Workshop on Database and Expert Systems Applications (DEXA)

Pub Date : 2017-08-01 DOI: 10.1109/DEXA.2017.35

Jukka Ruohonen

This short empirical paper investigates how well topic modeling and database meta-data characteristics can classify web and other proof-of-concept (PoC) exploits for publicly disclosed software vulnerabilities. By using a dataset comprised of over 36 thousand PoC exploits, near a 0.9 accuracy rate is obtained in the empirical experiment. Text mining and topic modeling are a significant boost factor behind this classification performance. In addition to these empirical results, the paper contributes to the research tradition of enhancing software vulnerability information with text mining, providing also a few scholarly observations about the potential for semi-automatic classification of exploits in the existing tracking infrastructures.

这篇简短的实证论文研究了主题建模和数据库元数据特征如何很好地分类网络和其他概念验证(PoC)利用公开披露的软件漏洞。通过使用由超过3.6万个PoC漏洞组成的数据集，在经验实验中获得了接近0.9的准确率。文本挖掘和主题建模是这种分类性能背后的重要提升因素。除了这些实证结果之外，本文还对利用文本挖掘增强软件漏洞信息的研究传统做出了贡献，并提供了一些关于现有跟踪基础设施中漏洞半自动分类潜力的学术观察。

引用次数: 13

Introducing Design Patterns to Knowledge Processing Systems in the Context of Big Data and Cloud Platforms 大数据和云平台背景下知识处理系统的设计模式引入

2017 28th International Workshop on Database and Expert Systems Applications (DEXA)

Pub Date : 2017-08-01 DOI: 10.1109/DEXA.2017.26

Stefan Nadschläger

Knowledge processing systems recently regained attention in the context of big "knowledge" processing and cloud platforms. Therefore, the development of such systems with a high software quality has to be ensured. In this paper an approach to contribute to an architectural guideline for developing such systems using the concept of design patterns is shown. The need, as well as current research in this domain is presented. Further, possible design pattern candidates are introduced that have been extracted from literature.

最近，在大“知识”处理和云平台的背景下，知识处理系统重新受到关注。因此，必须保证这类系统的开发具有较高的软件质量。本文展示了一种利用设计模式的概念为开发这样的系统提供架构指南的方法。介绍了该领域的需求和研究现状。此外，还介绍了从文献中提取的可能的候选设计模式。

引用次数: 0

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2017 28th International Workshop on Database and Expert Systems Applications (DEXA)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀