首页 > 最新文献

Data最新文献

英文 中文
Medical Opinions Analysis about the Decrease of Autopsies Using Emerging Pattern Mining 利用新兴模式挖掘分析有关尸检减少的医学观点
IF 2.6 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-21 DOI: 10.3390/data9010002
Isaac Machorro-Cano, Ingrid Aylin Ríos-Méndez, José Antonio Palet-Guzmán, Nidia Rodríguez-Mazahua, L. Rodríguez-Mazahua, G. Alor-Hernández, J. O. Olmedo-Aguirre
An autopsy is a widely recognized procedure to guarantee ongoing enhancements in medicine. It finds extensive application in legal, scientific, medical, and research domains. However, declining autopsy rates in hospitals constitute a worldwide concern. For example, the Regional Hospital of Rio Blanco in Veracruz, Mexico, has substantially reduced the number of autopsies at hospitals in recent years. Since there are no documented historical records of a decrease in the frequency of autopsy cases, it is crucial to establish a methodological framework to substantiate any actual trends in the data. Emerging pattern mining (EPM) allows for finding differences between classes or data sets because it builds a descriptive data model concerning some given remarkable property. Data set description has become a significant application area in various contexts in recent years. In this research study, various EPM (emerging pattern mining) algorithms were used to extract emergent patterns from a data set collected based on medical experts’ perspectives on reducing hospital autopsies. Notably, the top-performing EPM algorithms were iEPMiner, LCMine, SJEP-C, Top-k minimal SJEPs, and Tree-based JEP-C. Among these, iEPMiner and LCMine demonstrated faster performance and produced superior emergent patterns when considering metrics such as Confidence, Weighted Relative Accuracy Criteria (WRACC), False Positive Rate (FPR), and True Positive Rate (TPR).
尸体解剖是公认的保证医学不断进步的程序。它广泛应用于法律、科学、医学和研究领域。然而,医院尸检率的下降却引起了全世界的关注。例如,墨西哥韦拉克鲁斯州的里奥布兰科地区医院近年来大幅减少了医院的尸检数量。由于没有尸检频率下降的历史记录,因此建立一个方法框架来证实数据的实际趋势至关重要。新兴模式挖掘(EPM)可以发现类别或数据集之间的差异,因为它建立了一个关于某些给定显著属性的描述性数据模型。近年来,数据集描述已成为各种情况下的一个重要应用领域。在这项研究中,我们使用了各种 EPM(新兴模式挖掘)算法,从收集的数据集中提取新兴模式,这些数据基于医学专家对减少医院尸检的看法。值得注意的是,表现最好的 EPM 算法是 iEPMiner、LCMine、SJEP-C、Top-k 最小 SJEPs 和基于树的 JEP-C。其中,iEPMiner 和 LCMine 表现更快,在考虑置信度、加权相对准确度标准 (WRACC)、假阳性率 (FPR) 和真阳性率 (TPR) 等指标时,产生的新兴模式更优。
{"title":"Medical Opinions Analysis about the Decrease of Autopsies Using Emerging Pattern Mining","authors":"Isaac Machorro-Cano, Ingrid Aylin Ríos-Méndez, José Antonio Palet-Guzmán, Nidia Rodríguez-Mazahua, L. Rodríguez-Mazahua, G. Alor-Hernández, J. O. Olmedo-Aguirre","doi":"10.3390/data9010002","DOIUrl":"https://doi.org/10.3390/data9010002","url":null,"abstract":"An autopsy is a widely recognized procedure to guarantee ongoing enhancements in medicine. It finds extensive application in legal, scientific, medical, and research domains. However, declining autopsy rates in hospitals constitute a worldwide concern. For example, the Regional Hospital of Rio Blanco in Veracruz, Mexico, has substantially reduced the number of autopsies at hospitals in recent years. Since there are no documented historical records of a decrease in the frequency of autopsy cases, it is crucial to establish a methodological framework to substantiate any actual trends in the data. Emerging pattern mining (EPM) allows for finding differences between classes or data sets because it builds a descriptive data model concerning some given remarkable property. Data set description has become a significant application area in various contexts in recent years. In this research study, various EPM (emerging pattern mining) algorithms were used to extract emergent patterns from a data set collected based on medical experts’ perspectives on reducing hospital autopsies. Notably, the top-performing EPM algorithms were iEPMiner, LCMine, SJEP-C, Top-k minimal SJEPs, and Tree-based JEP-C. Among these, iEPMiner and LCMine demonstrated faster performance and produced superior emergent patterns when considering metrics such as Confidence, Weighted Relative Accuracy Criteria (WRACC), False Positive Rate (FPR), and True Positive Rate (TPR).","PeriodicalId":36824,"journal":{"name":"Data","volume":"51 4","pages":""},"PeriodicalIF":2.6,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138951220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unlocking Insights: Analysing COVID-19 Lockdown Policies and Mobility Data in Victoria, Australia, through a Data-Driven Machine Learning Approach 开启洞察力:通过数据驱动的机器学习方法分析澳大利亚维多利亚州 COVID-19 封锁政策和流动性数据
IF 2.6 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-21 DOI: 10.3390/data9010003
Shiyang Lyu, O. Adegboye, Kiki Adhinugraha, T. Emeto, David Taniar
The state of Victoria, Australia, implemented one of the world’s most prolonged cumulative lockdowns in 2020 and 2021. Although lockdowns have proven effective in managing COVID-19 worldwide, this approach faced challenges in containing the rising infection in Victoria. This study evaluates the effects of short-term (less than 60 days) and long-term (more than 60 days) lockdowns on public mobility and the effectiveness of various social restriction measures within these periods. The aim is to understand the complexities of pandemic management by examining various measures over different lockdown durations, thereby contributing to more effective COVID-19 containment methods. Using restriction policy, community mobility, and COVID-19 data, a machine-learning-based simulation model was proposed, incorporating analysis of correlation, infection doubling time, and effective lockdown date. The model result highlights the significant impact of public event cancellations in preventing COVID-19 infection during short- and long-term lockdowns and the importance of international travel controls in long-term lockdowns. The effectiveness of social restriction was found to decrease significantly with the transition from short to long lockdowns, characterised by increased visits to public places and increased use of public transport, which may be associated with an increase in the effective reproduction number (Rt) and infected cases.
澳大利亚维多利亚州在 2020 年和 2021 年实施了世界上累计时间最长的封锁措施之一。尽管全球范围内的封锁已被证明能有效控制 COVID-19,但在维多利亚州,这种方法在遏制感染率上升方面仍面临挑战。本研究评估了短期(少于 60 天)和长期(超过 60 天)封锁对公众流动性的影响,以及在这些时期内各种社会限制措施的有效性。目的是通过研究不同封锁期限内的各种措施来了解大流行管理的复杂性,从而为制定更有效的 COVID-19 遏制方法做出贡献。利用限制政策、社区流动性和 COVID-19 数据,提出了一个基于机器学习的模拟模型,其中纳入了相关性分析、感染加倍时间和有效封锁日期。模型结果表明,在短期和长期封锁期间,取消公共活动对预防 COVID-19 感染有重要影响,而在长期封锁期间,国际旅行控制也很重要。研究发现,随着短期封锁向长期封锁的过渡,社会限制的效果会明显降低,其特点是公共场所的访问量和公共交通工具的使用量增加,这可能与有效繁殖数(Rt)和感染病例的增加有关。
{"title":"Unlocking Insights: Analysing COVID-19 Lockdown Policies and Mobility Data in Victoria, Australia, through a Data-Driven Machine Learning Approach","authors":"Shiyang Lyu, O. Adegboye, Kiki Adhinugraha, T. Emeto, David Taniar","doi":"10.3390/data9010003","DOIUrl":"https://doi.org/10.3390/data9010003","url":null,"abstract":"The state of Victoria, Australia, implemented one of the world’s most prolonged cumulative lockdowns in 2020 and 2021. Although lockdowns have proven effective in managing COVID-19 worldwide, this approach faced challenges in containing the rising infection in Victoria. This study evaluates the effects of short-term (less than 60 days) and long-term (more than 60 days) lockdowns on public mobility and the effectiveness of various social restriction measures within these periods. The aim is to understand the complexities of pandemic management by examining various measures over different lockdown durations, thereby contributing to more effective COVID-19 containment methods. Using restriction policy, community mobility, and COVID-19 data, a machine-learning-based simulation model was proposed, incorporating analysis of correlation, infection doubling time, and effective lockdown date. The model result highlights the significant impact of public event cancellations in preventing COVID-19 infection during short- and long-term lockdowns and the importance of international travel controls in long-term lockdowns. The effectiveness of social restriction was found to decrease significantly with the transition from short to long lockdowns, characterised by increased visits to public places and increased use of public transport, which may be associated with an increase in the effective reproduction number (Rt) and infected cases.","PeriodicalId":36824,"journal":{"name":"Data","volume":"20 1","pages":""},"PeriodicalIF":2.6,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138952789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Expert-Annotated Dataset to Study Cyberbullying in Polish Language 研究波兰语网络欺凌的专家注释数据集
IF 2.6 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-20 DOI: 10.3390/data9010001
Michal Ptaszynski, Agata Pieciukiewicz, Pawel Dybala, Paweł Skrzek, Kamil Soliwoda, Marcin Fortuna, Gniewosz Leliwa, Michal Wroczynski
We introduce the first dataset of harmful and offensive language collected from the Polish Internet. This dataset was meticulously curated to facilitate the exploration of harmful online phenomena such as cyberbullying and hate speech, which have exhibited a significant surge both within the Polish Internet as well as globally. The dataset was systematically collected and then annotated using two approaches. First, it was annotated by two proficient layperson volunteers, operating under the guidance of a specialist in the language of cyberbullying and hate speech. To enhance the precision of the annotations, a secondary round of annotations was carried out by a team of adept annotators with specialized long-term expertise in cyberbullying and hate speech annotations. This second phase was further overseen by an experienced annotator, acting as a super-annotator. In its initial application, the dataset was leveraged for the categorization of cyberbullying instances in the Polish language. Specifically, the dataset serves as the foundation for two distinct tasks: (1) a binary classification that segregates harmful and non-harmful messages and (2) a multi-class classification that distinguishes between two variations of harmful content (cyberbullying and hate speech), as well as a non-harmful category. Alongside the dataset itself, we also provide the models that showed satisfying classification performance. These models are made accessible for third-party use in constructing cyberbullying prevention systems.
我们介绍了首个从波兰互联网收集的有害和攻击性语言数据集。我们对该数据集进行了精心策划,以促进对网络欺凌和仇恨言论等有害网络现象的探索。该数据集采用两种方法进行系统收集和注释。首先,由两名熟练的非专业志愿者在网络欺凌和仇恨言论语言专家的指导下进行注释。为了提高注释的精确度,由长期从事网络欺凌和仇恨言论注释工作的专业注释员团队进行了第二轮注释。第二阶段由一名经验丰富的注释员作为超级注释员进一步监督。在最初的应用中,该数据集被用于对波兰语中的网络欺凌实例进行分类。具体来说,该数据集是两项不同任务的基础:(1) 区分有害信息和非有害信息的二元分类;(2) 区分有害内容(网络欺凌和仇恨言论)的两种变体以及非有害类别的多类分类。除了数据集本身,我们还提供了分类效果令人满意的模型。这些模型可供第三方用于构建网络欺凌预防系统。
{"title":"Expert-Annotated Dataset to Study Cyberbullying in Polish Language","authors":"Michal Ptaszynski, Agata Pieciukiewicz, Pawel Dybala, Paweł Skrzek, Kamil Soliwoda, Marcin Fortuna, Gniewosz Leliwa, Michal Wroczynski","doi":"10.3390/data9010001","DOIUrl":"https://doi.org/10.3390/data9010001","url":null,"abstract":"We introduce the first dataset of harmful and offensive language collected from the Polish Internet. This dataset was meticulously curated to facilitate the exploration of harmful online phenomena such as cyberbullying and hate speech, which have exhibited a significant surge both within the Polish Internet as well as globally. The dataset was systematically collected and then annotated using two approaches. First, it was annotated by two proficient layperson volunteers, operating under the guidance of a specialist in the language of cyberbullying and hate speech. To enhance the precision of the annotations, a secondary round of annotations was carried out by a team of adept annotators with specialized long-term expertise in cyberbullying and hate speech annotations. This second phase was further overseen by an experienced annotator, acting as a super-annotator. In its initial application, the dataset was leveraged for the categorization of cyberbullying instances in the Polish language. Specifically, the dataset serves as the foundation for two distinct tasks: (1) a binary classification that segregates harmful and non-harmful messages and (2) a multi-class classification that distinguishes between two variations of harmful content (cyberbullying and hate speech), as well as a non-harmful category. Alongside the dataset itself, we also provide the models that showed satisfying classification performance. These models are made accessible for third-party use in constructing cyberbullying prevention systems.","PeriodicalId":36824,"journal":{"name":"Data","volume":"35 4","pages":""},"PeriodicalIF":2.6,"publicationDate":"2023-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138994494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Genome Sequence of the Plant-Growth-Promoting Endophyte Curtobacterium flaccumfaciens Strain W004 促进植物生长的内生菌株 W004 的基因组序列
IF 2.6 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-09 DOI: 10.3390/data8120187
V. Chebotar, M. Gancheva, E. Chizhevskaya, M. E. Baganova, Oksana V. Keleinikova, Kharon A. Husainov, Veronika N. Pishchik
We report the whole-genome sequences of the endophyte Curtobacterium flaccumfaciens strain W004 isolated from the seeds of winter wheat, cv. Bezostaya 100. The genome was obtained using Oxford Nanopore MinION sequencing. The bacterium has a circular chromosome consisting of 3.63 kbp with a G+C% content of 70.89%. We found that Curtobacterium flaccumfaciens strain W004 could promote the growth of spring wheat plants, resulting in an increase in grain yield of 54.3%. Sequencing the genome of this new strain can provide insights into its potential role in plant–microbe interactions.
我们报告了从冬小麦品种 Bezostaya 100 种子中分离出的内生菌 Curtobacterium flaccumfaciens 菌株 W004 的全基因组序列。Bezostaya 100。基因组是通过牛津纳米孔 MinION 测序技术获得的。该细菌的环状染色体由 3.63 kbp 组成,G+C% 含量为 70.89%。我们发现,Curtobacterium flaccumfaciens 菌株 W004 能促进春小麦植株的生长,使谷物产量增加 54.3%。对这一新菌株的基因组进行测序可以深入了解其在植物与微生物相互作用中的潜在作用。
{"title":"Genome Sequence of the Plant-Growth-Promoting Endophyte Curtobacterium flaccumfaciens Strain W004","authors":"V. Chebotar, M. Gancheva, E. Chizhevskaya, M. E. Baganova, Oksana V. Keleinikova, Kharon A. Husainov, Veronika N. Pishchik","doi":"10.3390/data8120187","DOIUrl":"https://doi.org/10.3390/data8120187","url":null,"abstract":"We report the whole-genome sequences of the endophyte Curtobacterium flaccumfaciens strain W004 isolated from the seeds of winter wheat, cv. Bezostaya 100. The genome was obtained using Oxford Nanopore MinION sequencing. The bacterium has a circular chromosome consisting of 3.63 kbp with a G+C% content of 70.89%. We found that Curtobacterium flaccumfaciens strain W004 could promote the growth of spring wheat plants, resulting in an increase in grain yield of 54.3%. Sequencing the genome of this new strain can provide insights into its potential role in plant–microbe interactions.","PeriodicalId":36824,"journal":{"name":"Data","volume":"53 2","pages":""},"PeriodicalIF":2.6,"publicationDate":"2023-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139010785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Qualitative Dataset for Coffee Bio-Aggressors Detection Based on the Ancestral Knowledge of the Cauca Coffee Farmers in Colombia 基于哥伦比亚考卡咖啡种植者祖传知识的咖啡生物侵害检测定性数据集
IF 2.6 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-08 DOI: 10.3390/data8120186
Juan Felipe Valencia-Mosquera, David Griol, Mayra Solarte-Montoya, Cristhian Figueroa, Juan Carlos Corrales, David Camilo Corrales
This paper describes a novel qualitative dataset regarding coffee pests based on the ancestral knowledge of coffee farmers in the Department of Cauca, Colombia. The dataset has been obtained from a survey applied to coffee growers with 432 records and 41 variables collected weekly from September 2020 to August 2021. The qualitative dataset includes climatic conditions, productive activities, external conditions, and coffee bio-aggressors. This dataset allows researchers to find patterns for coffee crop protection through the ancestral knowledge not detected by real-time agricultural sensors. As far as we are concerned, there are no datasets like the one presented in this paper with similar characteristics of qualitative value that express the empirical knowledge of coffee farmers used to detect triggers of causal behaviors of pests and diseases in coffee crops.
本文描述了一个基于哥伦比亚考卡省咖啡农民祖先知识的关于咖啡害虫的新型定性数据集。该数据集来自对咖啡种植者的调查,从2020年9月到2021年8月,每周收集432条记录和41个变量。定性数据集包括气候条件、生产活动、外部条件和咖啡生物侵略者。该数据集允许研究人员通过未被实时农业传感器检测到的祖先知识找到咖啡作物保护的模式。就我们而言,没有像本文所提供的数据集具有类似的定性价值特征,可以表达咖啡农用于检测咖啡作物病虫害因果行为触发因素的经验知识。
{"title":"A Qualitative Dataset for Coffee Bio-Aggressors Detection Based on the Ancestral Knowledge of the Cauca Coffee Farmers in Colombia","authors":"Juan Felipe Valencia-Mosquera, David Griol, Mayra Solarte-Montoya, Cristhian Figueroa, Juan Carlos Corrales, David Camilo Corrales","doi":"10.3390/data8120186","DOIUrl":"https://doi.org/10.3390/data8120186","url":null,"abstract":"This paper describes a novel qualitative dataset regarding coffee pests based on the ancestral knowledge of coffee farmers in the Department of Cauca, Colombia. The dataset has been obtained from a survey applied to coffee growers with 432 records and 41 variables collected weekly from September 2020 to August 2021. The qualitative dataset includes climatic conditions, productive activities, external conditions, and coffee bio-aggressors. This dataset allows researchers to find patterns for coffee crop protection through the ancestral knowledge not detected by real-time agricultural sensors. As far as we are concerned, there are no datasets like the one presented in this paper with similar characteristics of qualitative value that express the empirical knowledge of coffee farmers used to detect triggers of causal behaviors of pests and diseases in coffee crops.","PeriodicalId":36824,"journal":{"name":"Data","volume":"38 14","pages":""},"PeriodicalIF":2.6,"publicationDate":"2023-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138588884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Land Cover Classification in the Antioquia Region of the Tropical Andes Using NICFI Satellite Data Program Imagery and Semantic Segmentation Techniques 利用 NICFI 卫星数据计划图像和语义分割技术对热带安第斯山脉安蒂奥基亚地区进行土地覆被分类
IF 2.6 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-04 DOI: 10.3390/data8120185
Luisa F. Gomez-Ossa, G. Sanchez-Torres, John W. Branch-Bedoya
Land cover classification, generated from satellite imagery through semantic segmentation, has become fundamental for monitoring land use and land cover change (LULCC). The tropical Andes territory provides opportunities due to its significance in the provision of ecosystem services. However, the lack of reliable data for this region, coupled with challenges arising from its mountainous topography and diverse ecosystems, hinders the description of its coverage. Therefore, this research proposes the Tropical Andes Land Cover Dataset (TALANDCOVER). It is constructed from three sample strategies: aleatory, minimum 50%, and 70% of representation per class, which address imbalanced geographic data. Additionally, the U-Net deep learning model is applied for enhanced and tailored classification of land covers. Using high-resolution data from the NICFI program, our analysis focuses on the Department of Antioquia in Colombia. The TALANDCOVER dataset, presented in TIF format, comprises multiband R-G-B-NIR images paired with six labels (dense forest, grasslands, heterogeneous agricultural areas, bodies of water, built-up areas, and bare-degraded lands) with an estimated 0.76 F1 score compared to ground truth data by expert knowledge and surpassing the precision of existing global cover maps for the study area. To the best of our knowledge, this work is a pioneer in its release of open-source data for segmenting coverages with pixel-wise labeled NICFI imagery at a 4.77 m resolution. The experiments carried out with the application of the sample strategies and models show F1 score values of 0.70, 0.72, and 0.74 for aleatory, balanced 50%, and balanced 70%, respectively, over the expert segmented sample (ground truth), which suggests that the personalized application of our deep learning model, together with the TALANDCOVER dataset offers different possibilities that facilitate the training of deep architectures for the classification of large-scale covers in complex areas, such as the tropical Andes. This advance has significant potential for decision making, emphasizing sustainable land use and the conservation of natural resources.
通过语义分割从卫星图像生成的土地覆盖分类已经成为监测土地利用和土地覆盖变化(LULCC)的基础。热带安第斯山脉地区因其在提供生态系统服务方面的重要性而提供了机会。然而,该地区缺乏可靠的数据,再加上山区地形和生态系统多样性带来的挑战,阻碍了对其覆盖范围的描述。为此,本研究提出了热带安第斯山脉土地覆盖数据集(TALANDCOVER)。它由三种样本策略构成:每个班级的代表率最低为50%和70%,这些策略解决了地理数据不平衡的问题。此外,U-Net深度学习模型应用于增强和定制的土地覆盖分类。使用来自NICFI项目的高分辨率数据,我们的分析集中在哥伦比亚的安蒂奥基亚省。TALANDCOVER数据集以TIF格式呈现,包括多波段R-G-B-NIR图像与六个标签(茂密森林、草原、异质农业区、水体、建成区和裸露退化土地)配对,与专家知识的地面真实数据相比,估计F1得分为0.76,超过了研究区域现有全球覆盖地图的精度。据我们所知,这项工作是发布开源数据的先驱,该数据用于在4.77米分辨率下使用逐像素标记的NICFI图像分割覆盖范围。使用样本策略和模型进行的实验显示,与专家分割样本(ground truth)相比,在选择性、平衡50%和平衡70%的情况下,F1得分分别为0.70、0.72和0.74,这表明我们的深度学习模型与TALANDCOVER数据集的个性化应用为复杂地区大规模覆盖分类的深度架构训练提供了不同的可能性。比如热带安第斯山脉。这一进展具有重大的决策潜力,强调可持续的土地利用和自然资源的保护。
{"title":"Land Cover Classification in the Antioquia Region of the Tropical Andes Using NICFI Satellite Data Program Imagery and Semantic Segmentation Techniques","authors":"Luisa F. Gomez-Ossa, G. Sanchez-Torres, John W. Branch-Bedoya","doi":"10.3390/data8120185","DOIUrl":"https://doi.org/10.3390/data8120185","url":null,"abstract":"Land cover classification, generated from satellite imagery through semantic segmentation, has become fundamental for monitoring land use and land cover change (LULCC). The tropical Andes territory provides opportunities due to its significance in the provision of ecosystem services. However, the lack of reliable data for this region, coupled with challenges arising from its mountainous topography and diverse ecosystems, hinders the description of its coverage. Therefore, this research proposes the Tropical Andes Land Cover Dataset (TALANDCOVER). It is constructed from three sample strategies: aleatory, minimum 50%, and 70% of representation per class, which address imbalanced geographic data. Additionally, the U-Net deep learning model is applied for enhanced and tailored classification of land covers. Using high-resolution data from the NICFI program, our analysis focuses on the Department of Antioquia in Colombia. The TALANDCOVER dataset, presented in TIF format, comprises multiband R-G-B-NIR images paired with six labels (dense forest, grasslands, heterogeneous agricultural areas, bodies of water, built-up areas, and bare-degraded lands) with an estimated 0.76 F1 score compared to ground truth data by expert knowledge and surpassing the precision of existing global cover maps for the study area. To the best of our knowledge, this work is a pioneer in its release of open-source data for segmenting coverages with pixel-wise labeled NICFI imagery at a 4.77 m resolution. The experiments carried out with the application of the sample strategies and models show F1 score values of 0.70, 0.72, and 0.74 for aleatory, balanced 50%, and balanced 70%, respectively, over the expert segmented sample (ground truth), which suggests that the personalized application of our deep learning model, together with the TALANDCOVER dataset offers different possibilities that facilitate the training of deep architectures for the classification of large-scale covers in complex areas, such as the tropical Andes. This advance has significant potential for decision making, emphasizing sustainable land use and the conservation of natural resources.","PeriodicalId":36824,"journal":{"name":"Data","volume":"4 11","pages":""},"PeriodicalIF":2.6,"publicationDate":"2023-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138603876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Urban Image Stimulus Set Generated from Social Media 由社交媒体生成的城市形象刺激集
IF 2.6 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-01 DOI: 10.3390/data8120184
Ardaman Kaur, André Leite Rodrigues, Sarah Hoogstraten, D. A. Blanco-Mora, B. Miranda, Paulo Morgado, Dar Meshi
Social media data, such as photos and status posts, can be tagged with location information (geotagging). This geotagged information can be used for urban spatial analysis to explore neighborhood characteristics or mobility patterns. With increasing rural-to-urban migration, there is a need for comprehensive data capturing the complexity of urban settings and their influence on human experiences. Here, we share an urban image stimulus set from the city of Lisbon that researchers can use in their experiments. The stimulus set consists of 160 geotagged urban space photographs extracted from the Flickr social media platform. We divided the city into 100 × 100 m cells to calculate the cell image density (number of images in each cell) and the cell green index (Normalized Difference Vegetation Index of each cell) and assigned these values to each geotagged image. In addition, we also computed the popularity of each image (normalized views on the social network). We also categorized these images into two putative groups by photographer status (residents and tourists), with 80 images belonging to each group. With the rise in data-driven decisions in urban planning, this stimulus set helps explore human–urban environment interaction patterns, especially if complemented with survey/neuroimaging measures or machine-learning analyses.
社交媒体数据,如照片和状态帖子,可以标记位置信息(地理标记)。这些地理标记信息可以用于城市空间分析,以探索社区特征或流动模式。随着越来越多的农村人口向城市迁移,需要收集综合数据,了解城市环境的复杂性及其对人类经验的影响。在这里,我们分享了来自里斯本市的城市图像刺激集,研究人员可以在他们的实验中使用。刺激集由160张从Flickr社交媒体平台上提取的带有地理标记的城市空间照片组成。我们将城市划分为100 × 100 m的单元格,计算单元格图像密度(每个单元格中的图像数量)和单元格绿色指数(每个单元格的归一化植被指数),并将这些值分配给每个地理标记图像。此外,我们还计算了每个图像的受欢迎程度(在社交网络上的规范化视图)。我们还根据摄影师的身份(居民和游客)将这些图像分为两组,每组有80张图像。随着城市规划中数据驱动决策的增加,该刺激集有助于探索人与城市环境的相互作用模式,特别是如果与调查/神经成像措施或机器学习分析相辅相成。
{"title":"An Urban Image Stimulus Set Generated from Social Media","authors":"Ardaman Kaur, André Leite Rodrigues, Sarah Hoogstraten, D. A. Blanco-Mora, B. Miranda, Paulo Morgado, Dar Meshi","doi":"10.3390/data8120184","DOIUrl":"https://doi.org/10.3390/data8120184","url":null,"abstract":"Social media data, such as photos and status posts, can be tagged with location information (geotagging). This geotagged information can be used for urban spatial analysis to explore neighborhood characteristics or mobility patterns. With increasing rural-to-urban migration, there is a need for comprehensive data capturing the complexity of urban settings and their influence on human experiences. Here, we share an urban image stimulus set from the city of Lisbon that researchers can use in their experiments. The stimulus set consists of 160 geotagged urban space photographs extracted from the Flickr social media platform. We divided the city into 100 × 100 m cells to calculate the cell image density (number of images in each cell) and the cell green index (Normalized Difference Vegetation Index of each cell) and assigned these values to each geotagged image. In addition, we also computed the popularity of each image (normalized views on the social network). We also categorized these images into two putative groups by photographer status (residents and tourists), with 80 images belonging to each group. With the rise in data-driven decisions in urban planning, this stimulus set helps explore human–urban environment interaction patterns, especially if complemented with survey/neuroimaging measures or machine-learning analyses.","PeriodicalId":36824,"journal":{"name":"Data","volume":"34 12","pages":""},"PeriodicalIF":2.6,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138627032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spectrogram Dataset of Korean Smartphone Audio Files Forged Using the “Mix Paste” Command 使用 "混合粘贴 "命令伪造的韩国智能手机音频文件频谱图数据集
IF 2.6 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-01 DOI: 10.3390/data8120183
Yeongmin Son, Won Jun Kwak, Jae Wan Park
This study focuses on the field of voice forgery detection, which is increasing in importance owing to the introduction of advanced voice editing technologies and the proliferation of smartphones. This study introduces a unique dataset that was built specifically to identify forgeries created using the “Mix Paste” technique. This editing technique can overlay audio segments from similar or different environments without creating a new timeframe, making it nearly infeasible to detect forgeries using traditional methods. The dataset consists of 4665 and 45,672 spectrogram images from 1555 original audio files and 15,224 forged audio files, respectively. The original audio was recorded using iPhone and Samsung Galaxy smartphones to ensure a realistic sampling environment. The forged files were created from these recordings and subsequently converted into spectrograms. The dataset also provided the metadata of the original voice files, offering additional context and information that could be used for analysis and detection. This dataset not only fills a gap in existing research but also provides valuable support for developing more efficient deep learning models for voice forgery detection. By addressing the “Mix Paste” technique, the dataset caters to a critical need in voice authentication and forensics, potentially contributing to enhancing security in society.
这项研究的重点是语音伪造检测领域,由于先进语音编辑技术的引入和智能手机的普及,这一领域的重要性日益增加。本研究介绍了一个独特的数据集,专门用于识别使用“混合粘贴”技术创建的伪造品。这种编辑技术可以覆盖来自相似或不同环境的音频片段,而无需创建新的时间框架,这使得使用传统方法检测伪造几乎不可行。该数据集分别由来自1555个原始音频文件和15224个伪造音频文件的4665张和45,672张频谱图图像组成。原始音频是使用iPhone和三星Galaxy智能手机录制的,以确保真实的采样环境。伪造文件是根据这些录音制作的,随后被转换成频谱图。该数据集还提供了原始语音文件的元数据,提供了可用于分析和检测的附加上下文和信息。该数据集不仅填补了现有研究的空白,而且为开发更有效的语音伪造检测深度学习模型提供了有价值的支持。通过解决“混合粘贴”技术,该数据集满足了语音认证和取证的关键需求,可能有助于提高社会的安全性。
{"title":"Spectrogram Dataset of Korean Smartphone Audio Files Forged Using the “Mix Paste” Command","authors":"Yeongmin Son, Won Jun Kwak, Jae Wan Park","doi":"10.3390/data8120183","DOIUrl":"https://doi.org/10.3390/data8120183","url":null,"abstract":"This study focuses on the field of voice forgery detection, which is increasing in importance owing to the introduction of advanced voice editing technologies and the proliferation of smartphones. This study introduces a unique dataset that was built specifically to identify forgeries created using the “Mix Paste” technique. This editing technique can overlay audio segments from similar or different environments without creating a new timeframe, making it nearly infeasible to detect forgeries using traditional methods. The dataset consists of 4665 and 45,672 spectrogram images from 1555 original audio files and 15,224 forged audio files, respectively. The original audio was recorded using iPhone and Samsung Galaxy smartphones to ensure a realistic sampling environment. The forged files were created from these recordings and subsequently converted into spectrograms. The dataset also provided the metadata of the original voice files, offering additional context and information that could be used for analysis and detection. This dataset not only fills a gap in existing research but also provides valuable support for developing more efficient deep learning models for voice forgery detection. By addressing the “Mix Paste” technique, the dataset caters to a critical need in voice authentication and forensics, potentially contributing to enhancing security in society.","PeriodicalId":36824,"journal":{"name":"Data","volume":" 27","pages":""},"PeriodicalIF":2.6,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138620912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Automated Big Data Quality Anomaly Correction Framework Using Predictive Analysis 使用预测分析的自动化大数据质量异常校正框架
IF 2.6 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-01 DOI: 10.3390/data8120182
Widad Elouataoui, Saida El Mendili, Youssef Gahi
Big data has emerged as a fundamental component in various domains, enabling organizations to extract valuable insights and make informed decisions. However, ensuring data quality is crucial for effectively using big data. Thus, big data quality has been gaining more attention in recent years by researchers and practitioners due to its significant impact on decision-making processes. However, existing studies addressing data quality anomalies often have a limited scope, concentrating on specific aspects such as outliers or inconsistencies. Moreover, many approaches are context-specific, lacking a generic solution applicable across different domains. To the best of our knowledge, no existing framework currently automatically addresses quality anomalies comprehensively and generically, considering all aspects of data quality. To fill the gaps in the field, we propose a sophisticated framework that automatically corrects big data quality anomalies using an intelligent predictive model. The proposed framework comprehensively addresses the main aspects of data quality by considering six key quality dimensions: Accuracy, Completeness, Conformity, Uniqueness, Consistency, and Readability. Moreover, the framework is not correlated to a specific field and is designed to be applicable across various areas, offering a generic approach to address data quality anomalies. The proposed framework was implemented on two datasets and has achieved an accuracy of 98.22%. Moreover, the results have shown that the framework has allowed the data quality to be boosted to a great score, reaching 99%, with an improvement rate of up to 14.76% of the quality score.
大数据已经成为各个领域的基本组成部分,使组织能够提取有价值的见解并做出明智的决策。然而,确保数据质量对于有效利用大数据至关重要。因此,大数据质量由于其对决策过程的重大影响,近年来越来越受到研究人员和实践者的关注。然而,解决数据质量异常的现有研究通常范围有限,集中在特定方面,如异常值或不一致性。此外,许多方法是特定于上下文的,缺乏适用于不同领域的通用解决方案。据我们所知,目前还没有一个现有的框架能够全面、通用地自动处理质量异常,并考虑到数据质量的所有方面。为了填补该领域的空白,我们提出了一个复杂的框架,该框架使用智能预测模型自动纠正大数据质量异常。该框架通过考虑六个关键质量维度:准确性、完整性、一致性、唯一性、一致性和可读性,全面解决了数据质量的主要方面。此外,该框架不与特定领域相关,并且被设计为适用于各个领域,提供了解决数据质量异常的通用方法。该框架在两个数据集上实现,准确率达到98.22%。此外,结果表明,该框架可以将数据质量提升到一个很大的分数,达到99%,质量分数的改进率高达14.76%。
{"title":"An Automated Big Data Quality Anomaly Correction Framework Using Predictive Analysis","authors":"Widad Elouataoui, Saida El Mendili, Youssef Gahi","doi":"10.3390/data8120182","DOIUrl":"https://doi.org/10.3390/data8120182","url":null,"abstract":"Big data has emerged as a fundamental component in various domains, enabling organizations to extract valuable insights and make informed decisions. However, ensuring data quality is crucial for effectively using big data. Thus, big data quality has been gaining more attention in recent years by researchers and practitioners due to its significant impact on decision-making processes. However, existing studies addressing data quality anomalies often have a limited scope, concentrating on specific aspects such as outliers or inconsistencies. Moreover, many approaches are context-specific, lacking a generic solution applicable across different domains. To the best of our knowledge, no existing framework currently automatically addresses quality anomalies comprehensively and generically, considering all aspects of data quality. To fill the gaps in the field, we propose a sophisticated framework that automatically corrects big data quality anomalies using an intelligent predictive model. The proposed framework comprehensively addresses the main aspects of data quality by considering six key quality dimensions: Accuracy, Completeness, Conformity, Uniqueness, Consistency, and Readability. Moreover, the framework is not correlated to a specific field and is designed to be applicable across various areas, offering a generic approach to address data quality anomalies. The proposed framework was implemented on two datasets and has achieved an accuracy of 98.22%. Moreover, the results have shown that the framework has allowed the data quality to be boosted to a great score, reaching 99%, with an improvement rate of up to 14.76% of the quality score.","PeriodicalId":36824,"journal":{"name":"Data","volume":"317 4","pages":""},"PeriodicalIF":2.6,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138625749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Introducing DeReKoGram: A Novel Frequency Dataset with Lemma and Part-of-Speech Information for German DeReKoGram:一种新的德语引理和词性信息频率数据集
Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-11-10 DOI: 10.3390/data8110170
Sascha Wolfer, Alexander Koplenig, Marc Kupietz, Carolin Müller-Spitzer
We introduce DeReKoGram, a novel frequency dataset containing lemma and part-of-speech (POS) information for 1-, 2-, and 3-grams from the German Reference Corpus. The dataset contains information based on a corpus of 43.2 billion tokens and is divided into 16 parts based on 16 corpus folds. We describe how the dataset was created and structured. By evaluating the distribution over the 16 folds, we show that it is possible to work with a subset of the folds in many use cases (e.g., to save computational resources). In a case study, we investigate the growth of vocabulary (as well as the number of hapax legomena) as an increasing number of folds are included in the analysis. We cross-combine this with the various cleaning stages of the dataset. We also give some guidance in the form of Python, R, and Stata markdown scripts on how to work with the resource.
我们介绍了DeReKoGram,这是一个新的频率数据集,包含来自德语参考语料库的1-,2-和3-g的引理和词性(POS)信息。该数据集包含基于432亿个token的语料库的信息,并基于16个语料库折叠分为16个部分。我们描述了数据集是如何创建和结构化的。通过评估16个折叠的分布,我们展示了在许多用例中使用折叠的子集是可能的(例如,为了节省计算资源)。在一个案例研究中,我们研究了随着分析中包含的折叠数量的增加,词汇量的增长(以及偶合现象的数量)。我们将其与数据集的各个清理阶段交叉结合。我们还以Python、R和Stata markdown脚本的形式提供了一些关于如何使用该资源的指导。
{"title":"Introducing DeReKoGram: A Novel Frequency Dataset with Lemma and Part-of-Speech Information for German","authors":"Sascha Wolfer, Alexander Koplenig, Marc Kupietz, Carolin Müller-Spitzer","doi":"10.3390/data8110170","DOIUrl":"https://doi.org/10.3390/data8110170","url":null,"abstract":"We introduce DeReKoGram, a novel frequency dataset containing lemma and part-of-speech (POS) information for 1-, 2-, and 3-grams from the German Reference Corpus. The dataset contains information based on a corpus of 43.2 billion tokens and is divided into 16 parts based on 16 corpus folds. We describe how the dataset was created and structured. By evaluating the distribution over the 16 folds, we show that it is possible to work with a subset of the folds in many use cases (e.g., to save computational resources). In a case study, we investigate the growth of vocabulary (as well as the number of hapax legomena) as an increasing number of folds are included in the analysis. We cross-combine this with the various cleaning stages of the dataset. We also give some guidance in the form of Python, R, and Stata markdown scripts on how to work with the resource.","PeriodicalId":36824,"journal":{"name":"Data","volume":" 43","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135191082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Data
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1