Frontiers in Big Data最新文献_第8页

From theory to practice: insights and hurdles in collecting social media data for social science research. 从理论到实践：为社会科学研究收集社交媒体数据的见解和障碍。

IF 3.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data

Pub Date : 2024-05-30 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1379921

Yan Chen, Kate Sherren, Kyung Young Lee, Lori McCay-Peet, Shan Xue, Michael Smit

Social media has profoundly changed our modes of self-expression, communication, and participation in public discourse, generating volumes of conversations and content that cover every aspect of our social lives. Social media platforms have thus become increasingly important as data sources to identify social trends and phenomena. In recent years, academics have steadily lost ground on access to social media data as technology companies have set more restrictions on Application Programming Interfaces (APIs) or entirely closed public APIs. This circumstance halts the work of many social scientists who have used such data to study issues of public good. We considered the viability of eight approaches for image-based social media data collection: data philanthropy organizations, data repositories, data donation, third-party data companies, homegrown tools, and various web scraping tools and scripts. This paper discusses the advantages and challenges of these approaches from literature and from the authors' experience. We conclude the paper by discussing mechanisms for improving social media data collection that will enable this future frontier of social science research.

社交媒体深刻地改变了我们的自我表达、交流和参与公共讨论的模式，产生了大量的对话和内容，涵盖了我们社会生活的方方面面。因此，社交媒体平台作为识别社会趋势和现象的数据来源变得越来越重要。近年来，随着技术公司对应用程序编程接口（API）设置更多限制或完全关闭公共 API，学术界在获取社交媒体数据方面逐渐失去了优势。在这种情况下，许多利用此类数据研究公益问题的社会科学家的工作被迫中断。我们考虑了八种基于图像的社交媒体数据收集方法的可行性：数据慈善组织、数据存储库、数据捐赠、第三方数据公司、自制工具以及各种网络刮擦工具和脚本。本文从文献和作者的经验出发，讨论了这些方法的优势和挑战。最后，我们讨论了改进社交媒体数据收集的机制，这些机制将使这一社会科学研究的未来前沿得以实现。

{"title":"From theory to practice: insights and hurdles in collecting social media data for social science research.","authors":"Yan Chen, Kate Sherren, Kyung Young Lee, Lori McCay-Peet, Shan Xue, Michael Smit","doi":"10.3389/fdata.2024.1379921","DOIUrl":"10.3389/fdata.2024.1379921","url":null,"abstract":"Social media has profoundly changed our modes of self-expression, communication, and participation in public discourse, generating volumes of conversations and content that cover every aspect of our social lives. Social media platforms have thus become increasingly important as data sources to identify social trends and phenomena. In recent years, academics have steadily lost ground on access to social media data as technology companies have set more restrictions on Application Programming Interfaces (APIs) or entirely closed public APIs. This circumstance halts the work of many social scientists who have used such data to study issues of public good. We considered the viability of eight approaches for image-based social media data collection: data philanthropy organizations, data repositories, data donation, third-party data companies, homegrown tools, and various web scraping tools and scripts. This paper discusses the advantages and challenges of these approaches from literature and from the authors' experience. We conclude the paper by discussing mechanisms for improving social media data collection that will enable this future frontier of social science research.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1379921"},"PeriodicalIF":3.1,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11169574/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141319551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Particulate matter forecast and prediction in Curitiba using machine learning. 利用机器学习对库里提巴的颗粒物进行预测和预报。

IF 3.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data

Pub Date : 2024-05-30 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1412837

Marianna Gonçalves Dias Chaves, Adriel Bilharva da Silva, Emílio Graciliano Ferreira Mercuri, Steffen Manfred Noe

Introduction: Air quality is directly affected by pollutant emission from vehicles, especially in large cities and metropolitan areas or when there is no compliance check for vehicle emission standards. Particulate Matter (PM) is one of the pollutants emitted from fuel burning in internal combustion engines and remains suspended in the atmosphere, causing respiratory and cardiovascular health problems to the population. In this study, we analyzed the interaction between vehicular emissions, meteorological variables, and particulate matter concentrations in the lower atmosphere, presenting methods for predicting and forecasting PM2.5.

Methods: Meteorological and vehicle flow data from the city of Curitiba, Brazil, and particulate matter concentration data from optical sensors installed in the city between 2020 and 2022 were organized in hourly and daily averages. Prediction and forecasting were based on two machine learning models: Random Forest (RF) and Long Short-Term Memory (LSTM) neural network. The baseline model for prediction was chosen as the Multiple Linear Regression (MLR) model, and for forecast, we used the naive estimation as baseline.

Results: RF showed that on hourly and daily prediction scales, the planetary boundary layer height was the most important variable, followed by wind gust and wind velocity in hourly or daily cases, respectively. The highest PM prediction accuracy (99.37%) was found using the RF model on a daily scale. For forecasting, the highest accuracy was 99.71% using the LSTM model for 1-h forecast horizon with 5 h of previous data used as input variables.

Discussion: The RF and LSTM models were able to improve prediction and forecasting compared with MLR and Naive, respectively. The LSTM was trained with data corresponding to the period of the COVID-19 pandemic (2020 and 2021) and was able to forecast the concentration of PM2.5 in 2022, in which the data show that there was greater circulation of vehicles and higher peaks in the concentration of PM2.5. Our results can help the physical understanding of factors influencing pollutant dispersion from vehicle emissions at the lower atmosphere in urban environment. This study supports the formulation of new government policies to mitigate the impact of vehicle emissions in large cities.

引言车辆排放的污染物直接影响空气质量，尤其是在大城市和大都市，或者在没有车辆排放标准达标检查的情况下。颗粒物（PM）是内燃机燃烧燃料时排放的污染物之一，悬浮在大气中，对人们的呼吸系统和心血管系统造成健康问题。在这项研究中，我们分析了车辆排放、气象变量和低层大气中颗粒物浓度之间的相互作用，提出了预测和预报 PM2.5 的方法：对巴西库里提巴市的气象和车辆流量数据，以及 2020 年至 2022 年期间安装在该市的光学传感器的颗粒物浓度数据进行了小时和日平均值整理。预测和预报基于两种机器学习模型：随机森林（RF）和长短期记忆（LSTM）神经网络。预测的基准模型选择了多元线性回归（MLR）模型，而预测则使用了天真估计作为基准：RF显示，在每小时和每天的预测尺度上，行星边界层高度是最重要的变量，其次分别是每小时或每天的阵风和风速。在日尺度上，射频模式的 PM 预测准确率最高（99.37%）。在预测方面，使用 LSTM 模型在 1 小时的预测范围内，以 5 小时的先前数据作为输入变量，预测准确率最高，达到 99.71%：与 MLR 和 Naive 相比，RF 和 LSTM 模型分别提高了预测和预报能力。用 COVID-19 大流行期间（2020 年和 2021 年）的数据对 LSTM 进行了训练，结果能够预测 2022 年的 PM2.5 浓度。我们的研究结果有助于从物理角度理解影响城市环境低层大气车辆排放污染物扩散的因素。这项研究有助于政府制定新的政策，以减轻汽车尾气排放对大城市的影响。

{"title":"Particulate matter forecast and prediction in Curitiba using machine learning.","authors":"Marianna Gonçalves Dias Chaves, Adriel Bilharva da Silva, Emílio Graciliano Ferreira Mercuri, Steffen Manfred Noe","doi":"10.3389/fdata.2024.1412837","DOIUrl":"10.3389/fdata.2024.1412837","url":null,"abstract":"Introduction: Air quality is directly affected by pollutant emission from vehicles, especially in large cities and metropolitan areas or when there is no compliance check for vehicle emission standards. Particulate Matter (PM) is one of the pollutants emitted from fuel burning in internal combustion engines and remains suspended in the atmosphere, causing respiratory and cardiovascular health problems to the population. In this study, we analyzed the interaction between vehicular emissions, meteorological variables, and particulate matter concentrations in the lower atmosphere, presenting methods for predicting and forecasting PM2.5.Methods: Meteorological and vehicle flow data from the city of Curitiba, Brazil, and particulate matter concentration data from optical sensors installed in the city between 2020 and 2022 were organized in hourly and daily averages. Prediction and forecasting were based on two machine learning models: Random Forest (RF) and Long Short-Term Memory (LSTM) neural network. The baseline model for prediction was chosen as the Multiple Linear Regression (MLR) model, and for forecast, we used the naive estimation as baseline.Results: RF showed that on hourly and daily prediction scales, the planetary boundary layer height was the most important variable, followed by wind gust and wind velocity in hourly or daily cases, respectively. The highest PM prediction accuracy (99.37%) was found using the RF model on a daily scale. For forecasting, the highest accuracy was 99.71% using the LSTM model for 1-h forecast horizon with 5 h of previous data used as input variables.Discussion: The RF and LSTM models were able to improve prediction and forecasting compared with MLR and Naive, respectively. The LSTM was trained with data corresponding to the period of the COVID-19 pandemic (2020 and 2021) and was able to forecast the concentration of PM2.5 in 2022, in which the data show that there was greater circulation of vehicles and higher peaks in the concentration of PM2.5. Our results can help the physical understanding of factors influencing pollutant dispersion from vehicle emissions at the lower atmosphere in urban environment. This study supports the formulation of new government policies to mitigate the impact of vehicle emissions in large cities.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1412837"},"PeriodicalIF":3.1,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11169811/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141318950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Analyzing political party positions through multi-language twitter text embeddings. 通过多语言 twitter 文本嵌入分析政党立场。

IF 3.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data

Pub Date : 2024-05-30 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1330392

Jinghui Chen, Takayuki Mizuno, Shohei Doi

Traditional monolingual word embedding models transform words into high-dimensional vectors which represent semantics relations between words as relationships between vectors in the high-dimensional space. They serve as productive tools to interpret multifarious aspects of the social world in social science research. Building on the previous research which interprets multifaceted meanings of words by projecting them onto word-level dimensions defined by differences between antonyms, we extend the architecture of establishing word-level cultural dimensions to the sentence level and adopt a Language-agnostic BERT model (LaBSE) to detect position similarities in a multi-language environment. We assess the efficacy of our sentence-level methodology using Twitter data from US politicians, comparing it to the traditional word-level embedding model. We also adopt Latent Dirichlet Allocation (LDA) to investigate detailed topics in these tweets and interpret politicians' positions from different angles. In addition, we adopt Twitter data from Spanish politicians and visualize their positions in a multi-language space to analyze position similarities across countries. The results show that our sentence-level methodology outperform traditional word-level model. We also demonstrate that our methodology is effective dealing with fine-sorted themes from the result that political positions towards different topics vary even within the same politicians. Through verification using American and Spanish political datasets, we find that the positioning of American and Spanish politicians on our defined liberal-conservative axis aligns with social common sense, political news, and previous research. Our architecture improves the standard word-level methodology and can be considered as a useful architecture for sentence-level applications in the future.

传统的单语词嵌入模型将词转化为高维向量，将词与词之间的语义关系表示为高维空间中向量之间的关系。在社会科学研究中，它们是解释社会世界多方面的有效工具。以往的研究通过将词语投射到由反义词之间的差异所定义的词语层面维度来解释词语的多层面含义，在此基础上，我们将建立词语层面文化维度的架构扩展到句子层面，并采用语言无关的 BERT 模型（LaBSE）来检测多语言环境中的位置相似性。我们使用美国政治家的 Twitter 数据评估了句子级方法的有效性，并将其与传统的词级嵌入模型进行了比较。我们还采用 Latent Dirichlet Allocation (LDA) 来研究这些推文中的详细主题，并从不同角度解读政治家的立场。此外，我们还采用了西班牙政治家的 Twitter 数据，并在多语言空间中将他们的立场可视化，以分析各国立场的相似性。结果表明，我们的句子级方法优于传统的单词级模型。我们还证明了我们的方法能有效地处理细分类主题，因为即使是同一政治人物，对不同主题的政治立场也各不相同。通过使用美国和西班牙的政治数据集进行验证，我们发现美国和西班牙政治家在我们定义的自由-保守轴线上的定位与社会常识、政治新闻和以往的研究相一致。我们的架构改进了标准的词级方法，可被视为未来句子级应用的有用架构。

{"title":"Analyzing political party positions through multi-language twitter text embeddings.","authors":"Jinghui Chen, Takayuki Mizuno, Shohei Doi","doi":"10.3389/fdata.2024.1330392","DOIUrl":"10.3389/fdata.2024.1330392","url":null,"abstract":"Traditional monolingual word embedding models transform words into high-dimensional vectors which represent semantics relations between words as relationships between vectors in the high-dimensional space. They serve as productive tools to interpret multifarious aspects of the social world in social science research. Building on the previous research which interprets multifaceted meanings of words by projecting them onto word-level dimensions defined by differences between antonyms, we extend the architecture of establishing word-level cultural dimensions to the sentence level and adopt a Language-agnostic BERT model (LaBSE) to detect position similarities in a multi-language environment. We assess the efficacy of our sentence-level methodology using Twitter data from US politicians, comparing it to the traditional word-level embedding model. We also adopt Latent Dirichlet Allocation (LDA) to investigate detailed topics in these tweets and interpret politicians' positions from different angles. In addition, we adopt Twitter data from Spanish politicians and visualize their positions in a multi-language space to analyze position similarities across countries. The results show that our sentence-level methodology outperform traditional word-level model. We also demonstrate that our methodology is effective dealing with fine-sorted themes from the result that political positions towards different topics vary even within the same politicians. Through verification using American and Spanish political datasets, we find that the positioning of American and Spanish politicians on our defined liberal-conservative axis aligns with social common sense, political news, and previous research. Our architecture improves the standard word-level methodology and can be considered as a useful architecture for sentence-level applications in the future.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1330392"},"PeriodicalIF":3.1,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11169868/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141318949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Toward explainable AI in radiology: Ensemble-CAM for effective thoracic disease localization in chest X-ray images using weak supervised learning. 实现放射学中可解释的人工智能：利用弱监督学习在胸部 X 光图像中实现有效胸部疾病定位的集合-CAM。

IF 3.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data

Pub Date : 2024-05-02 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1366415

Muhammad Aasem, Muhammad Javed Iqbal

Chest X-ray (CXR) imaging is widely employed by radiologists to diagnose thoracic diseases. Recently, many deep learning techniques have been proposed as computer-aided diagnostic (CAD) tools to assist radiologists in minimizing the risk of incorrect diagnosis. From an application perspective, these models have exhibited two major challenges: (1) They require large volumes of annotated data at the training stage and (2) They lack explainable factors to justify their outcomes at the prediction stage. In the present study, we developed a class activation mapping (CAM)-based ensemble model, called Ensemble-CAM, to address both of these challenges via weakly supervised learning by employing explainable AI (XAI) functions. Ensemble-CAM utilizes class labels to predict the location of disease in association with interpretable features. The proposed work leverages ensemble and transfer learning with class activation functions to achieve three objectives: (1) minimizing the dependency on strongly annotated data when locating thoracic diseases, (2) enhancing confidence in predicted outcomes by visualizing their interpretable features, and (3) optimizing cumulative performance via fusion functions. Ensemble-CAM was trained on three CXR image datasets and evaluated through qualitative and quantitative measures via heatmaps and Jaccard indices. The results reflect the enhanced performance and reliability in comparison to existing standalone and ensembled models.

胸部 X 光（CXR）成像被放射科医生广泛用于诊断胸部疾病。最近，许多深度学习技术被提出作为计算机辅助诊断（CAD）工具，以协助放射科医生最大限度地降低错误诊断的风险。从应用角度来看，这些模型面临两大挑战：（1）在训练阶段需要大量标注数据；（2）在预测阶段缺乏可解释的因素来证明其结果的合理性。在本研究中，我们开发了一种基于类激活映射（CAM）的集合模型，称为Ensemble-CAM，通过采用可解释人工智能（XAI）函数的弱监督学习来解决这两个难题。Ensemble-CAM利用类标签来预测与可解释特征相关的疾病位置。所提出的工作利用带有类激活函数的集合学习和迁移学习来实现三个目标：（1）在定位胸腔疾病时尽量减少对强注释数据的依赖；（2）通过可视化可解释特征来增强对预测结果的信心；（3）通过融合函数来优化累积性能。Ensemble-CAM 在三个 CXR 图像数据集上进行了训练，并通过热图和 Jaccard 指数进行了定性和定量评估。结果显示，与现有的独立模型和集合模型相比，其性能和可靠性都有所提高。

{"title":"Toward explainable AI in radiology: Ensemble-CAM for effective thoracic disease localization in chest X-ray images using weak supervised learning.","authors":"Muhammad Aasem, Muhammad Javed Iqbal","doi":"10.3389/fdata.2024.1366415","DOIUrl":"https://doi.org/10.3389/fdata.2024.1366415","url":null,"abstract":"Chest X-ray (CXR) imaging is widely employed by radiologists to diagnose thoracic diseases. Recently, many deep learning techniques have been proposed as computer-aided diagnostic (CAD) tools to assist radiologists in minimizing the risk of incorrect diagnosis. From an application perspective, these models have exhibited two major challenges: (1) They require large volumes of annotated data at the training stage and (2) They lack explainable factors to justify their outcomes at the prediction stage. In the present study, we developed a class activation mapping (CAM)-based ensemble model, called Ensemble-CAM, to address both of these challenges via weakly supervised learning by employing explainable AI (XAI) functions. Ensemble-CAM utilizes class labels to predict the location of disease in association with interpretable features. The proposed work leverages ensemble and transfer learning with class activation functions to achieve three objectives: (1) minimizing the dependency on strongly annotated data when locating thoracic diseases, (2) enhancing confidence in predicted outcomes by visualizing their interpretable features, and (3) optimizing cumulative performance via fusion functions. Ensemble-CAM was trained on three CXR image datasets and evaluated through qualitative and quantitative measures via heatmaps and Jaccard indices. The results reflect the enhanced performance and reliability in comparison to existing standalone and ensembled models.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1366415"},"PeriodicalIF":3.1,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11096460/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140960924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Corrigendum: A community focused approach toward making healthy and affordable daily diet recommendations. 更正：以社区为重点，提出健康且负担得起的日常饮食建议。

IF 3.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data

Pub Date : 2024-04-04 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1396638

Joe Germino, Annalisa Szymanski, Heather A Eicher-Miller, Ronald Metoyer, Nitesh V Chawla

[This corrects the article DOI: 10.3389/fdata.2023.1086212.].

[此处更正了文章 DOI：10.3389/fdata.2023.1086212]。

引用次数: 0

Exploring dermoscopic structures for melanoma lesions' classification. 探索用于黑色素瘤病变分类的皮肤镜结构。

IF 3.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data

Pub Date : 2024-03-25 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1366312

Fiza Saeed Malik, Muhammad Haroon Yousaf, Hassan Ahmed Sial, Serestina Viriri

Background: Melanoma is one of the deadliest skin cancers that originate from melanocytes due to sun exposure, causing mutations. Early detection boosts the cure rate to 90%, but misclassification drops survival to 15-20%. Clinical variations challenge dermatologists in distinguishing benign nevi and melanomas. Current diagnostic methods, including visual analysis and dermoscopy, have limitations, emphasizing the need for Artificial Intelligence understanding in dermatology.

Objectives: In this paper, we aim to explore dermoscopic structures for the classification of melanoma lesions. The training of AI models faces a challenge known as brittleness, where small changes in input images impact the classification. A study explored AI vulnerability in discerning melanoma from benign lesions using features of size, color, and shape. Tests with artificial and natural variations revealed a notable decline in accuracy, emphasizing the necessity for additional information, such as dermoscopic structures.

Methodology: The study utilizes datasets with clinically marked dermoscopic images examined by expert clinicians. Transformers and CNN-based models are employed to classify these images based on dermoscopic structures. Classification results are validated using feature visualization. To assess model susceptibility to image variations, classifiers are evaluated on test sets with original, duplicated, and digitally modified images. Additionally, testing is done on ISIC 2016 images. The study focuses on three dermoscopic structures crucial for melanoma detection: Blue-white veil, dots/globules, and streaks.

Results: In evaluating model performance, adding convolutions to Vision Transformers proves highly effective for achieving up to 98% accuracy. CNN architectures like VGG-16 and DenseNet-121 reach 50-60% accuracy, performing best with features other than dermoscopic structures. Vision Transformers without convolutions exhibit reduced accuracy on diverse test sets, revealing their brittleness. OpenAI Clip, a pre-trained model, consistently performs well across various test sets. To address brittleness, a mitigation method involving extensive data augmentation during training and 23 transformed duplicates during test time, sustains accuracy.

Conclusions: This paper proposes a melanoma classification scheme utilizing three dermoscopic structures across Ph2 and Derm7pt datasets. The study addresses AI susceptibility to image variations. Despite a small dataset, future work suggests collecting more annotated datasets and automatic computation of dermoscopic structural features.

背景：黑色素瘤是最致命的皮肤癌之一：黑色素瘤是最致命的皮肤癌之一，它源于黑色素细胞，由于阳光照射导致突变。早期发现可将治愈率提高到 90%，但错误分类会将存活率降至 15-20%。临床变化给皮肤科医生区分良性痣和黑色素瘤带来了挑战。目前的诊断方法，包括视觉分析和皮肤镜检查，都存在局限性，这就强调了人工智能在皮肤病学中的必要性：本文旨在探索用于黑色素瘤病变分类的皮肤镜结构。人工智能模型的训练面临着一个被称为脆性的挑战，即输入图像的微小变化都会影响分类。一项研究探讨了人工智能在利用大小、颜色和形状特征辨别黑色素瘤和良性病变时的脆弱性。使用人工和自然变化进行的测试表明，准确率明显下降，这强调了额外信息（如皮肤镜结构）的必要性：该研究利用了由临床专家检查的具有临床标记的皮肤镜图像数据集。研究采用了变压器和基于 CNN 的模型，根据皮肤镜结构对这些图像进行分类。分类结果通过特征可视化进行验证。为了评估模型对图像变化的敏感性，在包含原始图像、复制图像和数字修改图像的测试集上对分类器进行了评估。此外，还在 ISIC 2016 图像上进行了测试。研究重点关注对黑色素瘤检测至关重要的三种皮肤镜结构：蓝白纱、小点/球状物和条纹：在评估模型性能时，将卷积添加到视觉变换器中被证明非常有效，准确率高达 98%。VGG-16 和 DenseNet-121 等 CNN 架构的准确率为 50-60%，在处理皮肤镜结构以外的特征时表现最佳。没有卷积的视觉变换器在各种测试集上的准确率都有所下降，暴露了它们的脆性。预训练模型 OpenAI Clip 在各种测试集上的表现始终如一。为了解决脆性问题，我们采用了一种缓解方法，包括在训练期间进行大量数据扩增，以及在测试期间进行 23 次重复转换，从而保持了准确性：本文在 Ph2 和 Derm7pt 数据集上利用三种皮肤镜结构提出了一种黑色素瘤分类方案。这项研究解决了人工智能易受图像变化影响的问题。尽管数据集较小，但未来的工作建议收集更多的注释数据集，并自动计算皮肤镜结构特征。

{"title":"Exploring dermoscopic structures for melanoma lesions' classification.","authors":"Fiza Saeed Malik, Muhammad Haroon Yousaf, Hassan Ahmed Sial, Serestina Viriri","doi":"10.3389/fdata.2024.1366312","DOIUrl":"https://doi.org/10.3389/fdata.2024.1366312","url":null,"abstract":"Background: Melanoma is one of the deadliest skin cancers that originate from melanocytes due to sun exposure, causing mutations. Early detection boosts the cure rate to 90%, but misclassification drops survival to 15-20%. Clinical variations challenge dermatologists in distinguishing benign nevi and melanomas. Current diagnostic methods, including visual analysis and dermoscopy, have limitations, emphasizing the need for Artificial Intelligence understanding in dermatology.Objectives: In this paper, we aim to explore dermoscopic structures for the classification of melanoma lesions. The training of AI models faces a challenge known as brittleness, where small changes in input images impact the classification. A study explored AI vulnerability in discerning melanoma from benign lesions using features of size, color, and shape. Tests with artificial and natural variations revealed a notable decline in accuracy, emphasizing the necessity for additional information, such as dermoscopic structures.Methodology: The study utilizes datasets with clinically marked dermoscopic images examined by expert clinicians. Transformers and CNN-based models are employed to classify these images based on dermoscopic structures. Classification results are validated using feature visualization. To assess model susceptibility to image variations, classifiers are evaluated on test sets with original, duplicated, and digitally modified images. Additionally, testing is done on ISIC 2016 images. The study focuses on three dermoscopic structures crucial for melanoma detection: Blue-white veil, dots/globules, and streaks.Results: In evaluating model performance, adding convolutions to Vision Transformers proves highly effective for achieving up to 98% accuracy. CNN architectures like VGG-16 and DenseNet-121 reach 50-60% accuracy, performing best with features other than dermoscopic structures. Vision Transformers without convolutions exhibit reduced accuracy on diverse test sets, revealing their brittleness. OpenAI Clip, a pre-trained model, consistently performs well across various test sets. To address brittleness, a mitigation method involving extensive data augmentation during training and 23 transformed duplicates during test time, sustains accuracy.Conclusions: This paper proposes a melanoma classification scheme utilizing three dermoscopic structures across Ph2 and Derm7pt datasets. The study addresses AI susceptibility to image variations. Despite a small dataset, future work suggests collecting more annotated datasets and automatic computation of dermoscopic structural features.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1366312"},"PeriodicalIF":3.1,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10999676/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140869541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Urban delineation through a prism of intraday commute patterns. 通过日常通勤模式的棱镜划分城市。

IF 3.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data

Pub Date : 2024-03-05 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1356116

Yuri Bogomolov, Alexander Belyi, Stanislav Sobolevsky

Introduction: Urban mobility patterns are crucial for effective urban and transportation planning. This study investigates the dynamics of urban mobility in Brno, Czech Republic, utilizing the rich dataset provided by passive mobile phone data. Understanding these patterns is essential for optimizing infrastructure and planning strategies.

Methods: We developed a methodological framework that incorporates bidirectional commute flows and integrates both urban and suburban commute networks. This comprehensive approach allows for a detailed representation of Brno's mobility landscape. By employing clustering techniques, we aimed to identify distinct mobility patterns within the city.

Results: Our analysis revealed consistent structural features within Brno's mobility patterns. We identified three distinct clusters: a central business district, residential communities, and an intermediate hybrid cluster. These clusters highlight the diversity of mobility demands across different parts of the city.

Discussion: The study demonstrates the significant potential of passive mobile phone data in enhancing our understanding of urban mobility patterns. The insights gained from intraday mobility data are invaluable for transportation planning decisions, allowing for the optimization of infrastructure utilization. The identification of distinct mobility patterns underscores the practical utility of our methodological advancements in informing more effective and efficient transportation planning strategies.

引言城市交通模式对于有效的城市和交通规划至关重要。本研究利用被动式手机数据提供的丰富数据集，对捷克共和国布尔诺市的城市交通动态进行了调查。了解这些模式对于优化基础设施和规划战略至关重要：我们开发了一个方法框架，其中包含双向通勤流，并整合了城市和郊区的通勤网络。通过这种综合方法，可以详细反映布尔诺的交通状况。通过使用聚类技术，我们旨在识别城市内部独特的流动模式：结果：我们的分析揭示了布尔诺流动模式中一致的结构特征。我们发现了三个不同的集群：中央商务区、住宅社区和中间混合集群。这些集群凸显了城市不同区域流动需求的多样性：这项研究表明，被动式手机数据在增进我们对城市交通模式的了解方面具有巨大潜力。从日常流动数据中获得的见解对于交通规划决策非常宝贵，可以优化基础设施的利用。对独特流动模式的识别突出表明，我们在方法论上的进步在为更有效、更高效的交通规划战略提供信息方面具有实际效用。

{"title":"Urban delineation through a prism of intraday commute patterns.","authors":"Yuri Bogomolov, Alexander Belyi, Stanislav Sobolevsky","doi":"10.3389/fdata.2024.1356116","DOIUrl":"https://doi.org/10.3389/fdata.2024.1356116","url":null,"abstract":"Introduction: Urban mobility patterns are crucial for effective urban and transportation planning. This study investigates the dynamics of urban mobility in Brno, Czech Republic, utilizing the rich dataset provided by passive mobile phone data. Understanding these patterns is essential for optimizing infrastructure and planning strategies.Methods: We developed a methodological framework that incorporates bidirectional commute flows and integrates both urban and suburban commute networks. This comprehensive approach allows for a detailed representation of Brno's mobility landscape. By employing clustering techniques, we aimed to identify distinct mobility patterns within the city.Results: Our analysis revealed consistent structural features within Brno's mobility patterns. We identified three distinct clusters: a central business district, residential communities, and an intermediate hybrid cluster. These clusters highlight the diversity of mobility demands across different parts of the city.Discussion: The study demonstrates the significant potential of passive mobile phone data in enhancing our understanding of urban mobility patterns. The insights gained from intraday mobility data are invaluable for transportation planning decisions, allowing for the optimization of infrastructure utilization. The identification of distinct mobility patterns underscores the practical utility of our methodological advancements in informing more effective and efficient transportation planning strategies.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1356116"},"PeriodicalIF":3.1,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10948430/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140177714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Predicting risk of preterm birth in singleton pregnancies using machine learning algorithms. 利用机器学习算法预测单胎妊娠早产风险。

IF 3.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data

Pub Date : 2024-02-29 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1291196

Qiu-Yan Yu, Ying Lin, Yu-Run Zhou, Xin-Jun Yang, Joris Hemelaar

We aimed to develop, train, and validate machine learning models for predicting preterm birth (<37 weeks' gestation) in singleton pregnancies at different gestational intervals. Models were developed based on complete data from 22,603 singleton pregnancies from a prospective population-based cohort study that was conducted in 51 midwifery clinics and hospitals in Wenzhou City of China between 2014 and 2016. We applied Catboost, Random Forest, Stacked Model, Deep Neural Networks (DNN), and Support Vector Machine (SVM) algorithms, as well as logistic regression, to conduct feature selection and predictive modeling. Feature selection was implemented based on permutation-based feature importance lists derived from the machine learning models including all features, using a balanced training data set. To develop prediction models, the top 10%, 25%, and 50% most important predictive features were selected. Prediction models were developed with the training data set with 5-fold cross-validation for internal validation. Model performance was assessed using area under the receiver operating curve (AUC) values. The CatBoost-based prediction model after 26 weeks' gestation performed best with an AUC value of 0.70 (0.67, 0.73), accuracy of 0.81, sensitivity of 0.47, and specificity of 0.83. Number of antenatal care visits before 24 weeks' gestation, aspartate aminotransferase level at registration, symphysis fundal height, maternal weight, abdominal circumference, and blood pressure emerged as strong predictors after 26 completed weeks. The application of machine learning on pregnancy surveillance data is a promising approach to predict preterm birth and we identified several modifiable antenatal predictors.

我们的目标是开发、训练和验证预测早产的机器学习模型 (

{"title":"Predicting risk of preterm birth in singleton pregnancies using machine learning algorithms.","authors":"Qiu-Yan Yu, Ying Lin, Yu-Run Zhou, Xin-Jun Yang, Joris Hemelaar","doi":"10.3389/fdata.2024.1291196","DOIUrl":"10.3389/fdata.2024.1291196","url":null,"abstract":"We aimed to develop, train, and validate machine learning models for predicting preterm birth (<37 weeks' gestation) in singleton pregnancies at different gestational intervals. Models were developed based on complete data from 22,603 singleton pregnancies from a prospective population-based cohort study that was conducted in 51 midwifery clinics and hospitals in Wenzhou City of China between 2014 and 2016. We applied Catboost, Random Forest, Stacked Model, Deep Neural Networks (DNN), and Support Vector Machine (SVM) algorithms, as well as logistic regression, to conduct feature selection and predictive modeling. Feature selection was implemented based on permutation-based feature importance lists derived from the machine learning models including all features, using a balanced training data set. To develop prediction models, the top 10%, 25%, and 50% most important predictive features were selected. Prediction models were developed with the training data set with 5-fold cross-validation for internal validation. Model performance was assessed using area under the receiver operating curve (AUC) values. The CatBoost-based prediction model after 26 weeks' gestation performed best with an AUC value of 0.70 (0.67, 0.73), accuracy of 0.81, sensitivity of 0.47, and specificity of 0.83. Number of antenatal care visits before 24 weeks' gestation, aspartate aminotransferase level at registration, symphysis fundal height, maternal weight, abdominal circumference, and blood pressure emerged as strong predictors after 26 completed weeks. The application of machine learning on pregnancy surveillance data is a promising approach to predict preterm birth and we identified several modifiable antenatal predictors.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1291196"},"PeriodicalIF":3.1,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10941650/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140144558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficacy of federated learning on genomic data: a study on the UK Biobank and the 1000 Genomes Project. 联合学习在基因组数据方面的功效：对英国生物库和 1000 个基因组项目的研究。

IF 3.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data

Pub Date : 2024-02-29 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1266031

Dmitry Kolobkov, Satyarth Mishra Sharma, Aleksandr Medvedev, Mikhail Lebedev, Egor Kosaretskiy, Ruslan Vakhitov

Combining training data from multiple sources increases sample size and reduces confounding, leading to more accurate and less biased machine learning models. In healthcare, however, direct pooling of data is often not allowed by data custodians who are accountable for minimizing the exposure of sensitive information. Federated learning offers a promising solution to this problem by training a model in a decentralized manner thus reducing the risks of data leakage. Although there is increasing utilization of federated learning on clinical data, its efficacy on individual-level genomic data has not been studied. This study lays the groundwork for the adoption of federated learning for genomic data by investigating its applicability in two scenarios: phenotype prediction on the UK Biobank data and ancestry prediction on the 1000 Genomes Project data. We show that federated models trained on data split into independent nodes achieve performance close to centralized models, even in the presence of significant inter-node heterogeneity. Additionally, we investigate how federated model accuracy is affected by communication frequency and suggest approaches to reduce computational complexity or communication costs.

将来自多个来源的训练数据结合起来，可以增加样本量，减少混杂因素，从而建立更准确、偏差更小的机器学习模型。然而，在医疗保健领域，数据保管人往往不允许直接汇集数据，因为他们有责任尽量减少敏感信息的暴露。联盟学习以分散的方式训练模型，从而降低了数据泄漏的风险，为这一问题提供了一个很有前景的解决方案。虽然联合学习在临床数据上的应用越来越多，但其在个人层面基因组数据上的功效还未得到研究。本研究通过研究联合学习在两种情况下的适用性，为基因组数据的采用奠定了基础：英国生物库数据的表型预测和千人基因组计划数据的祖先预测。我们的研究表明，即使在节点间存在显著异质性的情况下，在分割成独立节点的数据上训练的联合模型也能获得接近集中模型的性能。此外，我们还研究了联合模型的准确性如何受到通信频率的影响，并提出了降低计算复杂性或通信成本的方法。

{"title":"Efficacy of federated learning on genomic data: a study on the UK Biobank and the 1000 Genomes Project.","authors":"Dmitry Kolobkov, Satyarth Mishra Sharma, Aleksandr Medvedev, Mikhail Lebedev, Egor Kosaretskiy, Ruslan Vakhitov","doi":"10.3389/fdata.2024.1266031","DOIUrl":"10.3389/fdata.2024.1266031","url":null,"abstract":"Combining training data from multiple sources increases sample size and reduces confounding, leading to more accurate and less biased machine learning models. In healthcare, however, direct pooling of data is often not allowed by data custodians who are accountable for minimizing the exposure of sensitive information. Federated learning offers a promising solution to this problem by training a model in a decentralized manner thus reducing the risks of data leakage. Although there is increasing utilization of federated learning on clinical data, its efficacy on individual-level genomic data has not been studied. This study lays the groundwork for the adoption of federated learning for genomic data by investigating its applicability in two scenarios: phenotype prediction on the UK Biobank data and ancestry prediction on the 1000 Genomes Project data. We show that federated models trained on data split into independent nodes achieve performance close to centralized models, even in the presence of significant inter-node heterogeneity. Additionally, we investigate how federated model accuracy is affected by communication frequency and suggest approaches to reduce computational complexity or communication costs.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1266031"},"PeriodicalIF":3.1,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10937521/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140133172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Knowledge-based recommender systems: overview and research directions. 基于知识的推荐系统：概述与研究方向。

IF 3.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data

Pub Date : 2024-02-26 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1304439

Mathias Uta, Alexander Felfernig, Viet-Man Le, Thi Ngoc Trang Tran, Damian Garber, Sebastian Lubos, Tamim Burgstaller

Recommender systems are decision support systems that help users to identify items of relevance from a potentially large set of alternatives. In contrast to the mainstream recommendation approaches of collaborative filtering and content-based filtering, knowledge-based recommenders exploit semantic user preference knowledge, item knowledge, and recommendation knowledge, to identify user-relevant items which is of specific relevance when dealing with complex and high-involvement items. Such recommenders are primarily applied in scenarios where users specify (and revise) their preferences, and related recommendations are determined on the basis of constraints or attribute-level similarity metrics. In this article, we provide an overview of the existing state-of-the-art in knowledge-based recommender systems. Different related recommendation techniques are explained on the basis of a working example from the domain of survey software services. On the basis of our analysis, we outline different directions for future research.

推荐系统是一种决策支持系统，可帮助用户从潜在的大量备选项目中识别相关项目。与协作过滤和基于内容的过滤等主流推荐方法不同，基于知识的推荐器利用语义用户偏好知识、项目知识和推荐知识来识别用户相关项目，这在处理复杂和高参与度项目时具有特殊意义。这类推荐器主要应用于用户指定（和修改）其偏好，并根据约束条件或属性级相似度指标确定相关推荐的场景。本文概述了基于知识的推荐系统的现有先进技术。我们以调查软件服务领域的一个工作实例为基础，解释了不同的相关推荐技术。在分析的基础上，我们概述了未来研究的不同方向。

{"title":"Knowledge-based recommender systems: overview and research directions.","authors":"Mathias Uta, Alexander Felfernig, Viet-Man Le, Thi Ngoc Trang Tran, Damian Garber, Sebastian Lubos, Tamim Burgstaller","doi":"10.3389/fdata.2024.1304439","DOIUrl":"10.3389/fdata.2024.1304439","url":null,"abstract":"Recommender systems are decision support systems that help users to identify items of relevance from a potentially large set of alternatives. In contrast to the mainstream recommendation approaches of collaborative filtering and content-based filtering, knowledge-based recommenders exploit semantic user preference knowledge, item knowledge, and recommendation knowledge, to identify user-relevant items which is of specific relevance when dealing with complex and high-involvement items. Such recommenders are primarily applied in scenarios where users specify (and revise) their preferences, and related recommendations are determined on the basis of constraints or attribute-level similarity metrics. In this article, we provide an overview of the existing state-of-the-art in knowledge-based recommender systems. Different related recommendation techniques are explained on the basis of a working example from the domain of survey software services. On the basis of our analysis, we outline different directions for future research.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1304439"},"PeriodicalIF":3.1,"publicationDate":"2024-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10925703/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140102782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0