Journal of Information and Data Management最新文献

英文中文

Capturing Provenance from Deep Learning Applications Using Keras-Prov and Colab: a Practical Approach 使用keras - prove和Colab从深度学习应用程序捕获来源:一种实用的方法

Journal of Information and Data Management

Pub Date : 2022-12-19 DOI: 10.5753/jidm.2022.2544

Débora Pina, L. Kunstmann, Felipe Bevilaqua, Isabela Siqueira, Alan Lyra, Daniel de Oliveira, M. Mattoso

Due to the exploratory nature of DNNs, DL specialists often need to modify the input dataset, change a filter when preprocessing input data, or fine-tune the models’ hyperparameters, while analyzing the evolution of the training. However, the specialist may lose track of what hyperparameter configurations have been used and tuned if these data are not properly registered. Thus, these configurations must be tracked and made available for the user’s analysis. One way of doing this is to use provenance data derivation traces to help the hyperparameter’s fine-tuning by providing a global data picture with clear dependencies. Current provenance solutions present provenance data disconnected from W3C PROV recommendation, which is difficult to reproduce and compare to other provenance data. To help with these challenges, we present Keras-Prov, an extension to the Keras deep learning library to collect provenance data compliant with PROV. To show the flexibility of Keras-Prov, we extend a previous Keras-Prov demonstration paper with larger experiments using GPUs with the help of Google Colab. Despite the challenges of running a DBMS with virtual environments, DL analysis with provenance has added trust and persistence in databases and PROV serializations. Experiments show Keras-Prov data analysis, during training execution, to support hyperparameter fine-tuning decisions, favoring the comparison, and reproducibility of such DL experiments. Keras-Prov is open source and can be downloaded from https://github.com/dbpina/keras-prov.

由于深度神经网络的探索性，深度学习专家经常需要修改输入数据集，在预处理输入数据时更改过滤器，或者在分析训练演变的同时微调模型的超参数。但是，如果这些数据没有正确注册，专家可能会失去使用和调优的超参数配置的跟踪。因此，必须跟踪这些配置，并使其可用于用户分析。这样做的一种方法是使用来源数据派生跟踪，通过提供具有明确依赖关系的全局数据图来帮助超参数的微调。目前的来源解决方案所提供的来源数据与W3C PROV推荐的来源数据是分离的，这很难再现并与其他来源数据进行比较。为了帮助解决这些挑战，我们提出了Keras- proof，这是Keras深度学习库的扩展，用于收集符合PROV的来源数据。为了展示keras - prove的灵活性，我们在Google Colab的帮助下扩展了以前的keras - prove演示论文，并使用gpu进行了更大的实验。尽管在虚拟环境中运行DBMS存在挑战，但具有来源的DL分析增加了数据库和PROV序列化中的信任和持久性。实验表明，在训练执行过程中，Keras-Prov数据分析支持超参数微调决策，有利于这种深度学习实验的比较和可重复性。Keras-Prov是开源的，可以从https://github.com/dbpina/keras-prov下载。

{"title":"Capturing Provenance from Deep Learning Applications Using Keras-Prov and Colab: a Practical Approach","authors":"Débora Pina, L. Kunstmann, Felipe Bevilaqua, Isabela Siqueira, Alan Lyra, Daniel de Oliveira, M. Mattoso","doi":"10.5753/jidm.2022.2544","DOIUrl":"https://doi.org/10.5753/jidm.2022.2544","url":null,"abstract":"Due to the exploratory nature of DNNs, DL specialists often need to modify the input dataset, change a filter when preprocessing input data, or fine-tune the models’ hyperparameters, while analyzing the evolution of the training. However, the specialist may lose track of what hyperparameter configurations have been used and tuned if these data are not properly registered. Thus, these configurations must be tracked and made available for the user’s analysis. One way of doing this is to use provenance data derivation traces to help the hyperparameter’s fine-tuning by providing a global data picture with clear dependencies. Current provenance solutions present provenance data disconnected from W3C PROV recommendation, which is difficult to reproduce and compare to other provenance data. To help with these challenges, we present Keras-Prov, an extension to the Keras deep learning library to collect provenance data compliant with PROV. To show the flexibility of Keras-Prov, we extend a previous Keras-Prov demonstration paper with larger experiments using GPUs with the help of Google Colab. Despite the challenges of running a DBMS with virtual environments, DL analysis with provenance has added trust and persistence in databases and PROV serializations. Experiments show Keras-Prov data analysis, during training execution, to support hyperparameter fine-tuning decisions, favoring the comparison, and reproducibility of such DL experiments. Keras-Prov is open source and can be downloaded from https://github.com/dbpina/keras-prov.","PeriodicalId":293511,"journal":{"name":"Journal of Information and Data Management","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114233017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Consistent Design of Relational Databases using EERCASE 基于EERCASE的关系数据库一致性设计

Journal of Information and Data Management

Pub Date : 2022-12-19 DOI: 10.5753/jidm.2022.2537

Robson N. Fidalgo, Edson A. Silva

This article introduces EERCASE, a Computer Aided Software Engineering tool that is based on the best practices of the Model Driven Development paradigm to provide a consistent environment for relational database design. EERCASE follows the graphical notation of the Enhanced Entity–Relationship model according to Elmasri and Navathe, implements the EERMM metamodel to avoid syntactically invalid constructs, shows and describes static semantic errors, and generates data definition code that takes into account advanced structural validations. The theoretical and technical framework used for the implementation of EERCASE is discussed, with emphasis on the restrictive and informative validations performed by it. In addition, considering feedbacks on modeling errors and code generation, EERCASE is also presented as a computational environment that favors active learning.

本文介绍了EERCASE，这是一种计算机辅助软件工程工具，它基于模型驱动开发范例的最佳实践，为关系数据库设计提供一致的环境。EERCASE遵循Elmasri和Navathe提出的增强实体-关系模型的图形符号，实现EERMM元模型以避免语法上无效的构造，显示和描述静态语义错误，并生成考虑高级结构验证的数据定义代码。讨论了用于实现EERCASE的理论和技术框架，重点是由它执行的限制性和信息性验证。此外，考虑到对建模错误和代码生成的反馈，EERCASE也作为一个有利于主动学习的计算环境呈现。

引用次数: 0

Searching for Researchers: an Ontology-based NoSQL Database System Approach and Practical Implementation 寻找研究人员:一种基于本体的NoSQL数据库系统方法及实际实现

Journal of Information and Data Management

Pub Date : 2022-12-19 DOI: 10.5753/jidm.2022.2601

Mariana D. A. Salgueiro, Verônica dos Santos, André L. C. Rêgo, Daniel S. Guimarães, Jefferson B. Santos, Edward H. Haeusler, Marcos V. Villas, Sérgio Lifschitz

This work presents the design and implementation of two web-based search systems, Busc@NIMA and Quem@PUC. Both systems allow the identification of research and development projects, besides existing competencies in laboratories and departments involving professors and researchers at PUC-Rio University. Our applications are based on a list of search-related terms that are matched to the dataset composed of PUC-Rio’s Lattes CVs offered courses, information from administrative systems, and specific keywords that are input by the professors/researchers themselves. To integrate all the needed data, we consider multiple database and search technologies, such as XML, RDF, TripleStores, and Relational Databases. Search results include professor’s name, academic papers, teaching activities, contact links, keywords, and laboratories of those involved with the subject represented by the set of keywords input. We describe the main features that show how our systems work.

本文介绍了两个基于web的搜索系统Busc@NIMA和Quem@PUC的设计和实现。这两种系统都允许确定研究和开发项目，以及北大里约大学教授和研究人员在实验室和部门的现有能力。我们的应用程序是基于与搜索相关的术语列表，这些术语与由pu - rio的拿铁简历提供的课程、管理系统的信息以及教授/研究人员自己输入的特定关键词组成的数据集相匹配。为了集成所有需要的数据，我们考虑了多种数据库和搜索技术，如XML、RDF、TripleStores和关系数据库。搜索结果包括教授姓名、学术论文、教学活动、联系链接、关键字以及与该主题相关的实验室，这些都是输入的关键字集合所代表的。我们描述了显示我们的系统如何工作的主要特性。

引用次数: 0

Scientific Collaboration Network Views: A Brazilian Computer Science Graduate Programs Case 科学合作网络观点:巴西计算机科学研究生项目案例

Journal of Information and Data Management

Pub Date : 2022-10-03 DOI: 10.5753/jidm.2022.2695

Aurelio Ribeiro Costa, Vanessa Tavares Nunes, Célia Ghedini Ralha

Scientific collaboration networks can present different views of researchers’ interactions. This work presents SCI-synergy, an online navigable artifact aiming to promote mechanisms and views of scientific collaboration networks. The artifact focuses on the researchers’ interaction in the co-authorship of publications considering intra- and interprogram relationships. SCI-synergy is developed upon the design science research paradigm using scientific publication data available on the large Digital Bibliography & Library Project (DBLP) repository. Official data from the Sucupira repository of six Brazilian graduate program members including Federal University of Minas Gerais (UFMG), State University of São Paulo (USP), Federal University of Rio Grande do Norte (UFRN), Federal University of Amazonas (UFAM), University of Brasília (UnB), and University of Vale do Rio dos Sinos (UNISINOS) is used. Data from these graduate programs illustrate the artifact usage regarding the scientific collaboration network of each program, how each researcher cooperates, and what relationship patterns exist in intra- and inter-programs views. We advocate that, even though it is necessary to consider data from each program’s history and current contextualization regarding politics, economics, and administration, the collaboration network views provided by SCI-synergy might help to understand collaboration network patterns.

科学合作网络可以呈现出对研究人员互动的不同看法。这项工作提出了SCI-synergy，一个在线可导航的工件，旨在促进科学协作网络的机制和观点。该工件关注研究人员在考虑项目内和项目间关系的共同作者的出版物中的相互作用。SCI-synergy是在设计科学研究范式的基础上开发的，使用大型数字书目和图书馆项目(DBLP)存储库中可用的科学出版数据。官方数据来自六个巴西研究生项目成员的Sucupira存储库，包括米纳斯吉拉斯州联邦大学(UFMG)、圣保罗州立大学(USP)、北里奥格兰德联邦大学(UFRN)、亚马逊联邦大学(UFAM)、Brasília大学(UnB)和西诺斯河谷大学(UNISINOS)。来自这些研究生项目的数据说明了与每个项目的科学协作网络有关的工件使用情况，每个研究人员如何合作，以及在项目内部和项目之间存在什么关系模式。我们主张，尽管有必要考虑每个项目的历史数据和当前关于政治、经济和管理的背景，但SCI-synergy提供的协作网络视图可能有助于理解协作网络模式。

{"title":"Scientific Collaboration Network Views: A Brazilian Computer Science Graduate Programs Case","authors":"Aurelio Ribeiro Costa, Vanessa Tavares Nunes, Célia Ghedini Ralha","doi":"10.5753/jidm.2022.2695","DOIUrl":"https://doi.org/10.5753/jidm.2022.2695","url":null,"abstract":"Scientific collaboration networks can present different views of researchers’ interactions. This work presents SCI-synergy, an online navigable artifact aiming to promote mechanisms and views of scientific collaboration networks. The artifact focuses on the researchers’ interaction in the co-authorship of publications considering intra- and interprogram relationships. SCI-synergy is developed upon the design science research paradigm using scientific publication data available on the large Digital Bibliography & Library Project (DBLP) repository. Official data from the Sucupira repository of six Brazilian graduate program members including Federal University of Minas Gerais (UFMG), State University of São Paulo (USP), Federal University of Rio Grande do Norte (UFRN), Federal University of Amazonas (UFAM), University of Brasília (UnB), and University of Vale do Rio dos Sinos (UNISINOS) is used. Data from these graduate programs illustrate the artifact usage regarding the scientific collaboration network of each program, how each researcher cooperates, and what relationship patterns exist in intra- and inter-programs views. We advocate that, even though it is necessary to consider data from each program’s history and current contextualization regarding politics, economics, and administration, the collaboration network views provided by SCI-synergy might help to understand collaboration network patterns.","PeriodicalId":293511,"journal":{"name":"Journal of Information and Data Management","volume":"58 8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123304080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SentiLexBR: An Automatic Methodology of Building Sentiment Lexicons for the Portuguese Language SentiLexBR:一种自动构建葡萄牙语情感词汇的方法

Journal of Information and Data Management

Pub Date : 2022-09-21 DOI: 10.5753/jidm.2022.2504

Tiago de Melo

User reviews are readily available on the Web and widely used for sentiment analysis tasks. Sentiment lexicons plays an important role in sentiment analysis, where each sentiment word is given a sentiment label (positive or negative) or score (1 or -1). However, a sentiment lexicon may express different sentiment polarity according different domain. In addition, only a few studies on Portuguese sentiment analysis are reported due to the lack of resources including domain-specific sentiment lexical corpora. In this paper, we present an effective methodology, called SentiLexBR, using probabilities of the Bayes’ Theorem for building a set of sentiment lexicons. An unsupervised algorithm is proposed to automatically identify sentiment lexicons with their polarities for the Portuguese language. Experimental results on user reviews datasets in 12 different domains indicate the effectiveness of our methodology in domain-specific sentiment lexicon generation for Portuguese. In addition, the sentiment lexicon produced by SentiLexBR also significantly outperforms several alternative approaches of building domain-specific sentiment lexicons.

用户评论在Web上很容易获得，并广泛用于情感分析任务。情感词汇在情感分析中扮演着重要的角色，每个情感词都被赋予一个情感标签(积极或消极)或得分(1或-1)。然而，一个情感词汇在不同的领域可能表达不同的情感极性。此外，由于缺乏包括特定领域情感词汇语料库在内的资源，对葡萄牙语情感分析的研究很少。在本文中，我们提出了一种有效的方法，称为SentiLexBR，使用贝叶斯定理的概率来构建一组情感词汇。提出了一种葡萄牙语情感词汇极性自动识别的无监督算法。在12个不同领域的用户评论数据集上的实验结果表明，我们的方法在葡萄牙语特定领域情感词典生成方面是有效的。此外，由SentiLexBR生成的情感词典也显著优于构建特定领域情感词典的几种替代方法。

引用次数: 0

Evaluation of Automatic Speech Recognition Approaches 自动语音识别方法的评价

Journal of Information and Data Management

Pub Date : 2022-09-21 DOI: 10.5753/jidm.2022.2514

Regis Pires Magalhães, Daniel Jean Rodrigues Vasconcelos, Guilherme Sales Fernandes, Lívia Almada Cruz, Matheus Xavier Sampaio, José Antônio Fernandes de Macêdo, Ticiana Linhares Coelho da Silva

Automatic Speech Recognition (ASR) is essential for many applications like automatic caption generation for videos, voice search, voice commands for smart homes, and chatbots. Due to the increasing popularity of these applications and the advances in deep learning models for transcribing speech into text, this work aims to evaluate the performance of commercial solutions for ASR that use deep learning models, such as Facebook Wit.ai, Microsoft Azure Speech, Google Cloud Speech-to-Text, Wav2Vec, and AWS Transcribe. We performed the experiments with two real and public datasets, the Mozilla Common Voice and the Voxforge. The results demonstrate that the evaluated solutions slightly differ. However, Facebook Wit.ai outperforms the other analyzed approaches for the quality metrics collected like WER, BLEU, and METEOR. We also experiment to fine-tune Jasper Neural Network for ASR with four datasets different with no intersection to the ones we collect the quality metrics. We study the performance of the Jasper model for the two public datasets, comparing its results with the other pre-trained models.

自动语音识别(ASR)对于视频的自动字幕生成、语音搜索、智能家居的语音命令和聊天机器人等许多应用都是必不可少的。由于这些应用程序的日益普及以及用于将语音转录为文本的深度学习模型的进步，本工作旨在评估使用深度学习模型(如Facebook Wit)的ASR商业解决方案的性能。ai、微软Azure语音、谷歌云语音转文本、Wav2Vec和AWS转录。我们用两个真实的公共数据集，Mozilla Common Voice和Voxforge进行了实验。结果表明，评估的解略有不同。然而，Facebook机智。ai在收集质量指标方面优于其他分析方法，如WER、BLEU和METEOR。我们还用四个不同的数据集对Jasper神经网络进行了微调，这些数据集与我们收集的质量指标没有交集。我们研究了Jasper模型在两个公共数据集上的性能，并将其结果与其他预训练模型进行了比较。

{"title":"Evaluation of Automatic Speech Recognition Approaches","authors":"Regis Pires Magalhães, Daniel Jean Rodrigues Vasconcelos, Guilherme Sales Fernandes, Lívia Almada Cruz, Matheus Xavier Sampaio, José Antônio Fernandes de Macêdo, Ticiana Linhares Coelho da Silva","doi":"10.5753/jidm.2022.2514","DOIUrl":"https://doi.org/10.5753/jidm.2022.2514","url":null,"abstract":"Automatic Speech Recognition (ASR) is essential for many applications like automatic caption generation for videos, voice search, voice commands for smart homes, and chatbots. Due to the increasing popularity of these applications and the advances in deep learning models for transcribing speech into text, this work aims to evaluate the performance of commercial solutions for ASR that use deep learning models, such as Facebook Wit.ai, Microsoft Azure Speech, Google Cloud Speech-to-Text, Wav2Vec, and AWS Transcribe. We performed the experiments with two real and public datasets, the Mozilla Common Voice and the Voxforge. The results demonstrate that the evaluated solutions slightly differ. However, Facebook Wit.ai outperforms the other analyzed approaches for the quality metrics collected like WER, BLEU, and METEOR. We also experiment to fine-tune Jasper Neural Network for ASR with four datasets different with no intersection to the ones we collect the quality metrics. We study the performance of the Jasper model for the two public datasets, comparing its results with the other pre-trained models.","PeriodicalId":293511,"journal":{"name":"Journal of Information and Data Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115998589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

FASED: A Framework for Data Ecosystems Health Evaluation 基于:数据生态系统健康评估框架

Journal of Information and Data Management

Pub Date : 2022-09-21 DOI: 10.5753/jidm.2022.2511

Glória de Fátima B. Lima, Marcelo Iury S. Oliveira, Bernadette Farias Lóscio

The growing availability of data in digital media has contributed to the creation of a large number of data ecosystems. However, having successful Data Ecosystem is still a challenge. In order to prevent the failure of a Data Ecosystem and ensure its survival, evaluating its health becomes fundamental. In a general way, the health of a Data Ecosystem can be defined as its ability to grow and survive over time. Indicators such as productivity, robustness, niche creation and sustainability can be employed to evaluate the health of a Data Ecosystem. In this paper, we propose a framework for data Ecosystem health evaluation composed of a set of indicators and metrics, which assess the Data Ecosystem’s current state and its ability to stay healthy over time. The results obtained when using the proposed framework offers evidence to assist in decision making on how data has being published and consumed in a Data Ecosystem, as well as to evaluate which ecosystems are more prosperous or need more investments.

数字媒体中数据的可用性日益增加，促进了大量数据生态系统的创建。然而，拥有一个成功的数据生态系统仍然是一个挑战。为了防止数据生态系统的失败并确保其生存，评估其健康状况变得至关重要。一般来说，数据生态系统的健康状况可以定义为其随时间增长和生存的能力。生产率、稳健性、生态位创造和可持续性等指标可用于评估数据生态系统的健康状况。在本文中，我们提出了一个由一组指标和度量组成的数据生态系统健康评估框架，用于评估数据生态系统的当前状态及其随时间保持健康的能力。使用拟议框架时获得的结果为数据生态系统中如何发布和消费数据的决策提供了证据，并评估哪些生态系统更繁荣或需要更多投资。

引用次数: 1

FeatSet+: Visual Features Extracted from Public Image Datasets feature set +:从公共图像数据集中提取的视觉特征

Journal of Information and Data Management

Pub Date : 2022-08-15 DOI: 10.5753/jidm.2022.2328

Mirela T. Cazzolato, Lucas C. Scabora, Guilherme F. Zabot, Marco A. Gutierrez, Caetano Traina Jr., Agma J. M. Traina

Real-world applications generate large amounts of images every day. With the generalized use of social media, users frequently share images acquired by smartphones. Also, hospitals, clinics, exhibits, factories, and other facilities generate images with potential use for many applications. Processing the generated images usually requires feature extraction, which can be time-consuming and laborious. In this paper, we present FeatSet+, a compilation of color, texture and shape visual features extracted from 17 open image datasets reported in the literature. FeatSet+ provides a collection of 11 distinct visual features, extracted by well-known Feature Extraction Methods (FEMs) such as LBP, Haralick, and Color Layout. We organized the available features in a standard collection, including the metadata and labels, when available. Eleven of the datasets also contain classes, which aid the evaluation of supervised methods such as classifiers and clustering tasks. FeatSet+ is available for download in a public repository as sql scripts and csv files. Additionally, FeatSet+ provides a description of the domain of each dataset, including the reference to the original work and link. We show the potential applicability of FeatSet+ in four computational tasks: multi-attribute analysis and retrieval, visual analysis using Multidimensional Scaling (MDS) and Principal Components Analysis (PCA), global feature classification, and dimensionality reduction. FeatSet+ can be employed to evaluate supervised and non-supervised learning tasks, also widely supporting Content-Based Image Retrieval (CBIR) applications and complex data indexing using Metric Access Methods (MAMs).

现实世界的应用程序每天都会生成大量的图像。随着社交媒体的广泛使用，用户频繁地分享通过智能手机获取的图片。此外，医院、诊所、展览、工厂和其他设施生成的图像具有许多应用程序的潜在用途。对生成的图像进行处理通常需要进行特征提取，这既耗时又费力。在本文中，我们展示了一个从文献报道的17个开放图像数据集中提取的颜色、纹理和形状视觉特征的汇编。FeatSet+提供了11个不同的视觉特征，通过著名的特征提取方法(fem)，如LBP, Haralick和Color Layout提取。我们将可用的特性组织在一个标准集合中，包括可用的元数据和标签。其中11个数据集还包含类，这有助于评估监督方法，如分类器和聚类任务。FeatSet+可以在公共存储库中以sql脚本和csv文件的形式下载。此外，FeatSet+还提供了每个数据集的域描述，包括对原始作品的引用和链接。我们展示了FeatSet+在四个计算任务中的潜在适用性:多属性分析和检索、使用多维尺度(MDS)和主成分分析(PCA)的视觉分析、全局特征分类和降维。FeatSet+可用于评估监督和非监督学习任务，也广泛支持基于内容的图像检索(CBIR)应用和使用度量访问方法(MAMs)的复杂数据索引。

{"title":"FeatSet+: Visual Features Extracted from Public Image Datasets","authors":"Mirela T. Cazzolato, Lucas C. Scabora, Guilherme F. Zabot, Marco A. Gutierrez, Caetano Traina Jr., Agma J. M. Traina","doi":"10.5753/jidm.2022.2328","DOIUrl":"https://doi.org/10.5753/jidm.2022.2328","url":null,"abstract":"Real-world applications generate large amounts of images every day. With the generalized use of social media, users frequently share images acquired by smartphones. Also, hospitals, clinics, exhibits, factories, and other facilities generate images with potential use for many applications. Processing the generated images usually requires feature extraction, which can be time-consuming and laborious. In this paper, we present FeatSet+, a compilation of color, texture and shape visual features extracted from 17 open image datasets reported in the literature. FeatSet+ provides a collection of 11 distinct visual features, extracted by well-known Feature Extraction Methods (FEMs) such as LBP, Haralick, and Color Layout. We organized the available features in a standard collection, including the metadata and labels, when available. Eleven of the datasets also contain classes, which aid the evaluation of supervised methods such as classifiers and clustering tasks. FeatSet+ is available for download in a public repository as sql scripts and csv files. Additionally, FeatSet+ provides a description of the domain of each dataset, including the reference to the original work and link. We show the potential applicability of FeatSet+ in four computational tasks: multi-attribute analysis and retrieval, visual analysis using Multidimensional Scaling (MDS) and Principal Components Analysis (PCA), global feature classification, and dimensionality reduction. FeatSet+ can be employed to evaluate supervised and non-supervised learning tasks, also widely supporting Content-Based Image Retrieval (CBIR) applications and complex data indexing using Metric Access Methods (MAMs).","PeriodicalId":293511,"journal":{"name":"Journal of Information and Data Management","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127017013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Collecting, extracting and storing web research survey questionnaires data 收集、提取和存储网络研究调查问卷数据

Journal of Information and Data Management

Pub Date : 2022-08-15 DOI: 10.5753/jidm.2022.2318

Carina F. Dorneles, Gilney N. Mathias

Companies or institutions can use survey questionnaires to evaluate items or products, analyze their employees/customers’ satisfaction or collect any data they consider helpful. Furthermore, questionnaires can be used to collect data that can be used in research studies. Some problems in creating such questionnaires involve: deciding which questions to ask, how to ask them, and how to organize them. Many research communities, especially in the healthcare field, maintain repositories that are publicly accessible and include different questionnaires that help professionals and researchers analyze the results of questions, add new questions, or even point out nonsense questions. In this paper, we describe: (i) web crawler, which scans the Web searching for sites that possibly contain questionnaires; (ii) an extractor, which extracts the questionnaires from the list of pages collected by the crawler and saves them into a relational database; and (iii) the public dataset we have created to persist the questionnaires. The database created can then serve to analyze these data and/or as a centralized base of examples to prepare new questionnaires or reuse existing questions. The experiments we have conducted demonstrate that our crawler has achieved 94,47%, and the extractor has achieved a precision between 90% and 92%.

公司或机构可以使用调查问卷来评估项目或产品，分析员工/客户满意度或收集他们认为有用的任何数据。此外，问卷调查可以用来收集数据，可以在研究中使用。制作此类问卷的一些问题包括:决定问哪些问题，如何问，以及如何组织这些问题。许多研究社区，特别是在医疗保健领域，维护可公开访问的存储库，并包含不同的问卷，帮助专业人员和研究人员分析问题的结果、添加新问题，甚至指出无意义的问题。在本文中，我们描述:(i)网络爬虫，它扫描网络搜索可能包含问卷的网站;(ii)提取器，从爬虫收集的页面列表中提取问卷，并将其保存到关系数据库中;(iii)我们为保存问卷而创建的公共数据集。然后创建的数据库可以用于分析这些数据和/或作为示例的集中基础，以准备新的问卷或重用现有的问题。实验结果表明，爬行器的准确率为94.47%，提取器的准确率为90% ~ 92%。

{"title":"Collecting, extracting and storing web research survey questionnaires data","authors":"Carina F. Dorneles, Gilney N. Mathias","doi":"10.5753/jidm.2022.2318","DOIUrl":"https://doi.org/10.5753/jidm.2022.2318","url":null,"abstract":"Companies or institutions can use survey questionnaires to evaluate items or products, analyze their employees/customers’ satisfaction or collect any data they consider helpful. Furthermore, questionnaires can be used to collect data that can be used in research studies. Some problems in creating such questionnaires involve: deciding which questions to ask, how to ask them, and how to organize them. Many research communities, especially in the healthcare field, maintain repositories that are publicly accessible and include different questionnaires that help professionals and researchers analyze the results of questions, add new questions, or even point out nonsense questions. In this paper, we describe: (i) web crawler, which scans the Web searching for sites that possibly contain questionnaires; (ii) an extractor, which extracts the questionnaires from the list of pages collected by the crawler and saves them into a relational database; and (iii) the public dataset we have created to persist the questionnaires. The database created can then serve to analyze these data and/or as a centralized base of examples to prepare new questionnaires or reuse existing questions. The experiments we have conducted demonstrate that our crawler has achieved 94,47%, and the extractor has achieved a precision between 90% and 92%.","PeriodicalId":293511,"journal":{"name":"Journal of Information and Data Management","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127426504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Musical Success in the United States and Brazil: Novel Datasets and Temporal Analyses 音乐在美国和巴西的成功:新的数据集和时间分析

Journal of Information and Data Management

Pub Date : 2022-08-15 DOI: 10.5753/jidm.2022.2350

Gabriel P. Oliveira, Gabriel R. G. Barbosa, Bruna C. Melo, Juliana E. Botelho, Mariana O. Silva, Danilo B. Seufitelli, Mirella M. Moro

Music is not only a worldwide essential cultural industry but also one of the most dynamic. The increasing volume of complex music-related data defines new challenges and opportunities for extracting knowledge, benefiting not only different music segments but also the Music Information Retrieval research field. In this article, we assess musical success in the United States and Brazil, two of the biggest music markets in the world. We first introduce MUHSIC and MUHSIC-BR, two novel datasets with enhanced success information that combine chart-related data with acoustic metadata to describe the temporal evolution of musical careers. Then, we use such enriched and curated data to cluster artists according to their success level by considering their high-impact periods (hot streaks). Our results reveal three groups with distinct success behavior over time. Furthermore, Brazil and the US present specific music success patterns regarding artists and genres, reflecting the importance of analyzing regional markets individually.

音乐不仅是世界范围内必不可少的文化产业，也是最具活力的文化产业之一。越来越多的复杂音乐相关数据为知识提取带来了新的挑战和机遇，这不仅有利于不同的音乐领域，也有利于音乐信息检索研究领域。在这篇文章中，我们评估了美国和巴西这两个世界上最大的音乐市场的音乐成功。我们首先介绍了MUHSIC和MUHSIC- br，这两个具有增强成功信息的新数据集，将图表相关数据与声学元数据相结合，以描述音乐职业的时间演变。然后，我们使用这些丰富和精心策划的数据，根据艺术家的成功水平，通过考虑他们的高影响力时期(热门时期)来对他们进行分组。我们的研究结果显示，随着时间的推移，有三组人的成功行为截然不同。此外，巴西和美国在艺术家和流派方面呈现出特定的音乐成功模式，这反映了单独分析区域市场的重要性。

引用次数: 3

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Journal of Information and Data Management

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀