Pub Date : 2024-05-30eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1379921
Yan Chen, Kate Sherren, Kyung Young Lee, Lori McCay-Peet, Shan Xue, Michael Smit
Social media has profoundly changed our modes of self-expression, communication, and participation in public discourse, generating volumes of conversations and content that cover every aspect of our social lives. Social media platforms have thus become increasingly important as data sources to identify social trends and phenomena. In recent years, academics have steadily lost ground on access to social media data as technology companies have set more restrictions on Application Programming Interfaces (APIs) or entirely closed public APIs. This circumstance halts the work of many social scientists who have used such data to study issues of public good. We considered the viability of eight approaches for image-based social media data collection: data philanthropy organizations, data repositories, data donation, third-party data companies, homegrown tools, and various web scraping tools and scripts. This paper discusses the advantages and challenges of these approaches from literature and from the authors' experience. We conclude the paper by discussing mechanisms for improving social media data collection that will enable this future frontier of social science research.
{"title":"From theory to practice: insights and hurdles in collecting social media data for social science research.","authors":"Yan Chen, Kate Sherren, Kyung Young Lee, Lori McCay-Peet, Shan Xue, Michael Smit","doi":"10.3389/fdata.2024.1379921","DOIUrl":"10.3389/fdata.2024.1379921","url":null,"abstract":"<p><p>Social media has profoundly changed our modes of self-expression, communication, and participation in public discourse, generating volumes of conversations and content that cover every aspect of our social lives. Social media platforms have thus become increasingly important as data sources to identify social trends and phenomena. In recent years, academics have steadily lost ground on access to social media data as technology companies have set more restrictions on Application Programming Interfaces (APIs) or entirely closed public APIs. This circumstance halts the work of many social scientists who have used such data to study issues of public good. We considered the viability of eight approaches for image-based social media data collection: data philanthropy organizations, data repositories, data donation, third-party data companies, homegrown tools, and various web scraping tools and scripts. This paper discusses the advantages and challenges of these approaches from literature and from the authors' experience. We conclude the paper by discussing mechanisms for improving social media data collection that will enable this future frontier of social science research.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1379921"},"PeriodicalIF":3.1,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11169574/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141319551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-30eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1412837
Marianna Gonçalves Dias Chaves, Adriel Bilharva da Silva, Emílio Graciliano Ferreira Mercuri, Steffen Manfred Noe
Introduction: Air quality is directly affected by pollutant emission from vehicles, especially in large cities and metropolitan areas or when there is no compliance check for vehicle emission standards. Particulate Matter (PM) is one of the pollutants emitted from fuel burning in internal combustion engines and remains suspended in the atmosphere, causing respiratory and cardiovascular health problems to the population. In this study, we analyzed the interaction between vehicular emissions, meteorological variables, and particulate matter concentrations in the lower atmosphere, presenting methods for predicting and forecasting PM2.5.
Methods: Meteorological and vehicle flow data from the city of Curitiba, Brazil, and particulate matter concentration data from optical sensors installed in the city between 2020 and 2022 were organized in hourly and daily averages. Prediction and forecasting were based on two machine learning models: Random Forest (RF) and Long Short-Term Memory (LSTM) neural network. The baseline model for prediction was chosen as the Multiple Linear Regression (MLR) model, and for forecast, we used the naive estimation as baseline.
Results: RF showed that on hourly and daily prediction scales, the planetary boundary layer height was the most important variable, followed by wind gust and wind velocity in hourly or daily cases, respectively. The highest PM prediction accuracy (99.37%) was found using the RF model on a daily scale. For forecasting, the highest accuracy was 99.71% using the LSTM model for 1-h forecast horizon with 5 h of previous data used as input variables.
Discussion: The RF and LSTM models were able to improve prediction and forecasting compared with MLR and Naive, respectively. The LSTM was trained with data corresponding to the period of the COVID-19 pandemic (2020 and 2021) and was able to forecast the concentration of PM2.5 in 2022, in which the data show that there was greater circulation of vehicles and higher peaks in the concentration of PM2.5. Our results can help the physical understanding of factors influencing pollutant dispersion from vehicle emissions at the lower atmosphere in urban environment. This study supports the formulation of new government policies to mitigate the impact of vehicle emissions in large cities.
{"title":"Particulate matter forecast and prediction in Curitiba using machine learning.","authors":"Marianna Gonçalves Dias Chaves, Adriel Bilharva da Silva, Emílio Graciliano Ferreira Mercuri, Steffen Manfred Noe","doi":"10.3389/fdata.2024.1412837","DOIUrl":"10.3389/fdata.2024.1412837","url":null,"abstract":"<p><strong>Introduction: </strong>Air quality is directly affected by pollutant emission from vehicles, especially in large cities and metropolitan areas or when there is no compliance check for vehicle emission standards. Particulate Matter (PM) is one of the pollutants emitted from fuel burning in internal combustion engines and remains suspended in the atmosphere, causing respiratory and cardiovascular health problems to the population. In this study, we analyzed the interaction between vehicular emissions, meteorological variables, and particulate matter concentrations in the lower atmosphere, presenting methods for predicting and forecasting PM2.5.</p><p><strong>Methods: </strong>Meteorological and vehicle flow data from the city of Curitiba, Brazil, and particulate matter concentration data from optical sensors installed in the city between 2020 and 2022 were organized in hourly and daily averages. Prediction and forecasting were based on two machine learning models: Random Forest (RF) and Long Short-Term Memory (LSTM) neural network. The baseline model for prediction was chosen as the Multiple Linear Regression (MLR) model, and for forecast, we used the naive estimation as baseline.</p><p><strong>Results: </strong>RF showed that on hourly and daily prediction scales, the planetary boundary layer height was the most important variable, followed by wind gust and wind velocity in hourly or daily cases, respectively. The highest PM prediction accuracy (99.37%) was found using the RF model on a daily scale. For forecasting, the highest accuracy was 99.71% using the LSTM model for 1-h forecast horizon with 5 h of previous data used as input variables.</p><p><strong>Discussion: </strong>The RF and LSTM models were able to improve prediction and forecasting compared with MLR and Naive, respectively. The LSTM was trained with data corresponding to the period of the COVID-19 pandemic (2020 and 2021) and was able to forecast the concentration of PM2.5 in 2022, in which the data show that there was greater circulation of vehicles and higher peaks in the concentration of PM2.5. Our results can help the physical understanding of factors influencing pollutant dispersion from vehicle emissions at the lower atmosphere in urban environment. This study supports the formulation of new government policies to mitigate the impact of vehicle emissions in large cities.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1412837"},"PeriodicalIF":3.1,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11169811/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141318950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-30eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1330392
Jinghui Chen, Takayuki Mizuno, Shohei Doi
Traditional monolingual word embedding models transform words into high-dimensional vectors which represent semantics relations between words as relationships between vectors in the high-dimensional space. They serve as productive tools to interpret multifarious aspects of the social world in social science research. Building on the previous research which interprets multifaceted meanings of words by projecting them onto word-level dimensions defined by differences between antonyms, we extend the architecture of establishing word-level cultural dimensions to the sentence level and adopt a Language-agnostic BERT model (LaBSE) to detect position similarities in a multi-language environment. We assess the efficacy of our sentence-level methodology using Twitter data from US politicians, comparing it to the traditional word-level embedding model. We also adopt Latent Dirichlet Allocation (LDA) to investigate detailed topics in these tweets and interpret politicians' positions from different angles. In addition, we adopt Twitter data from Spanish politicians and visualize their positions in a multi-language space to analyze position similarities across countries. The results show that our sentence-level methodology outperform traditional word-level model. We also demonstrate that our methodology is effective dealing with fine-sorted themes from the result that political positions towards different topics vary even within the same politicians. Through verification using American and Spanish political datasets, we find that the positioning of American and Spanish politicians on our defined liberal-conservative axis aligns with social common sense, political news, and previous research. Our architecture improves the standard word-level methodology and can be considered as a useful architecture for sentence-level applications in the future.
{"title":"Analyzing political party positions through multi-language twitter text embeddings.","authors":"Jinghui Chen, Takayuki Mizuno, Shohei Doi","doi":"10.3389/fdata.2024.1330392","DOIUrl":"10.3389/fdata.2024.1330392","url":null,"abstract":"<p><p>Traditional monolingual word embedding models transform words into high-dimensional vectors which represent semantics relations between words as relationships between vectors in the high-dimensional space. They serve as productive tools to interpret multifarious aspects of the social world in social science research. Building on the previous research which interprets multifaceted meanings of words by projecting them onto word-level dimensions defined by differences between antonyms, we extend the architecture of establishing word-level cultural dimensions to the sentence level and adopt a Language-agnostic BERT model (LaBSE) to detect position similarities in a multi-language environment. We assess the efficacy of our sentence-level methodology using Twitter data from US politicians, comparing it to the traditional word-level embedding model. We also adopt Latent Dirichlet Allocation (LDA) to investigate detailed topics in these tweets and interpret politicians' positions from different angles. In addition, we adopt Twitter data from Spanish politicians and visualize their positions in a multi-language space to analyze position similarities across countries. The results show that our sentence-level methodology outperform traditional word-level model. We also demonstrate that our methodology is effective dealing with fine-sorted themes from the result that political positions towards different topics vary even within the same politicians. Through verification using American and Spanish political datasets, we find that the positioning of American and Spanish politicians on our defined liberal-conservative axis aligns with social common sense, political news, and previous research. Our architecture improves the standard word-level methodology and can be considered as a useful architecture for sentence-level applications in the future.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1330392"},"PeriodicalIF":3.1,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11169868/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141318949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-02eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1366415
Muhammad Aasem, Muhammad Javed Iqbal
Chest X-ray (CXR) imaging is widely employed by radiologists to diagnose thoracic diseases. Recently, many deep learning techniques have been proposed as computer-aided diagnostic (CAD) tools to assist radiologists in minimizing the risk of incorrect diagnosis. From an application perspective, these models have exhibited two major challenges: (1) They require large volumes of annotated data at the training stage and (2) They lack explainable factors to justify their outcomes at the prediction stage. In the present study, we developed a class activation mapping (CAM)-based ensemble model, called Ensemble-CAM, to address both of these challenges via weakly supervised learning by employing explainable AI (XAI) functions. Ensemble-CAM utilizes class labels to predict the location of disease in association with interpretable features. The proposed work leverages ensemble and transfer learning with class activation functions to achieve three objectives: (1) minimizing the dependency on strongly annotated data when locating thoracic diseases, (2) enhancing confidence in predicted outcomes by visualizing their interpretable features, and (3) optimizing cumulative performance via fusion functions. Ensemble-CAM was trained on three CXR image datasets and evaluated through qualitative and quantitative measures via heatmaps and Jaccard indices. The results reflect the enhanced performance and reliability in comparison to existing standalone and ensembled models.
胸部 X 光(CXR)成像被放射科医生广泛用于诊断胸部疾病。最近,许多深度学习技术被提出作为计算机辅助诊断(CAD)工具,以协助放射科医生最大限度地降低错误诊断的风险。从应用角度来看,这些模型面临两大挑战:(1)在训练阶段需要大量标注数据;(2)在预测阶段缺乏可解释的因素来证明其结果的合理性。在本研究中,我们开发了一种基于类激活映射(CAM)的集合模型,称为Ensemble-CAM,通过采用可解释人工智能(XAI)函数的弱监督学习来解决这两个难题。Ensemble-CAM利用类标签来预测与可解释特征相关的疾病位置。所提出的工作利用带有类激活函数的集合学习和迁移学习来实现三个目标:(1)在定位胸腔疾病时尽量减少对强注释数据的依赖;(2)通过可视化可解释特征来增强对预测结果的信心;(3)通过融合函数来优化累积性能。Ensemble-CAM 在三个 CXR 图像数据集上进行了训练,并通过热图和 Jaccard 指数进行了定性和定量评估。结果显示,与现有的独立模型和集合模型相比,其性能和可靠性都有所提高。
{"title":"Toward explainable AI in radiology: Ensemble-CAM for effective thoracic disease localization in chest X-ray images using weak supervised learning.","authors":"Muhammad Aasem, Muhammad Javed Iqbal","doi":"10.3389/fdata.2024.1366415","DOIUrl":"https://doi.org/10.3389/fdata.2024.1366415","url":null,"abstract":"<p><p>Chest X-ray (CXR) imaging is widely employed by radiologists to diagnose thoracic diseases. Recently, many deep learning techniques have been proposed as computer-aided diagnostic (CAD) tools to assist radiologists in minimizing the risk of incorrect diagnosis. From an application perspective, these models have exhibited two major challenges: (1) They require large volumes of annotated data at the training stage and (2) They lack explainable factors to justify their outcomes at the prediction stage. In the present study, we developed a class activation mapping (CAM)-based ensemble model, called Ensemble-CAM, to address both of these challenges via weakly supervised learning by employing explainable AI (XAI) functions. Ensemble-CAM utilizes class labels to predict the location of disease in association with interpretable features. The proposed work leverages ensemble and transfer learning with class activation functions to achieve three objectives: (1) minimizing the dependency on strongly annotated data when locating thoracic diseases, (2) enhancing confidence in predicted outcomes by visualizing their interpretable features, and (3) optimizing cumulative performance via fusion functions. Ensemble-CAM was trained on three CXR image datasets and evaluated through qualitative and quantitative measures via heatmaps and Jaccard indices. The results reflect the enhanced performance and reliability in comparison to existing standalone and ensembled models.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1366415"},"PeriodicalIF":3.1,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11096460/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140960924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-04eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1396638
Joe Germino, Annalisa Szymanski, Heather A Eicher-Miller, Ronald Metoyer, Nitesh V Chawla
[This corrects the article DOI: 10.3389/fdata.2023.1086212.].
[此处更正了文章 DOI:10.3389/fdata.2023.1086212]。
{"title":"Corrigendum: A community focused approach toward making healthy and affordable daily diet recommendations.","authors":"Joe Germino, Annalisa Szymanski, Heather A Eicher-Miller, Ronald Metoyer, Nitesh V Chawla","doi":"10.3389/fdata.2024.1396638","DOIUrl":"https://doi.org/10.3389/fdata.2024.1396638","url":null,"abstract":"<p><p>[This corrects the article DOI: 10.3389/fdata.2023.1086212.].</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1396638"},"PeriodicalIF":3.1,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11024675/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140870346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-25eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1366312
Fiza Saeed Malik, Muhammad Haroon Yousaf, Hassan Ahmed Sial, Serestina Viriri
Background: Melanoma is one of the deadliest skin cancers that originate from melanocytes due to sun exposure, causing mutations. Early detection boosts the cure rate to 90%, but misclassification drops survival to 15-20%. Clinical variations challenge dermatologists in distinguishing benign nevi and melanomas. Current diagnostic methods, including visual analysis and dermoscopy, have limitations, emphasizing the need for Artificial Intelligence understanding in dermatology.
Objectives: In this paper, we aim to explore dermoscopic structures for the classification of melanoma lesions. The training of AI models faces a challenge known as brittleness, where small changes in input images impact the classification. A study explored AI vulnerability in discerning melanoma from benign lesions using features of size, color, and shape. Tests with artificial and natural variations revealed a notable decline in accuracy, emphasizing the necessity for additional information, such as dermoscopic structures.
Methodology: The study utilizes datasets with clinically marked dermoscopic images examined by expert clinicians. Transformers and CNN-based models are employed to classify these images based on dermoscopic structures. Classification results are validated using feature visualization. To assess model susceptibility to image variations, classifiers are evaluated on test sets with original, duplicated, and digitally modified images. Additionally, testing is done on ISIC 2016 images. The study focuses on three dermoscopic structures crucial for melanoma detection: Blue-white veil, dots/globules, and streaks.
Results: In evaluating model performance, adding convolutions to Vision Transformers proves highly effective for achieving up to 98% accuracy. CNN architectures like VGG-16 and DenseNet-121 reach 50-60% accuracy, performing best with features other than dermoscopic structures. Vision Transformers without convolutions exhibit reduced accuracy on diverse test sets, revealing their brittleness. OpenAI Clip, a pre-trained model, consistently performs well across various test sets. To address brittleness, a mitigation method involving extensive data augmentation during training and 23 transformed duplicates during test time, sustains accuracy.
Conclusions: This paper proposes a melanoma classification scheme utilizing three dermoscopic structures across Ph2 and Derm7pt datasets. The study addresses AI susceptibility to image variations. Despite a small dataset, future work suggests collecting more annotated datasets and automatic computation of dermoscopic structural features.
{"title":"Exploring dermoscopic structures for melanoma lesions' classification.","authors":"Fiza Saeed Malik, Muhammad Haroon Yousaf, Hassan Ahmed Sial, Serestina Viriri","doi":"10.3389/fdata.2024.1366312","DOIUrl":"https://doi.org/10.3389/fdata.2024.1366312","url":null,"abstract":"<p><strong>Background: </strong>Melanoma is one of the deadliest skin cancers that originate from melanocytes due to sun exposure, causing mutations. Early detection boosts the cure rate to 90%, but misclassification drops survival to 15-20%. Clinical variations challenge dermatologists in distinguishing benign nevi and melanomas. Current diagnostic methods, including visual analysis and dermoscopy, have limitations, emphasizing the need for Artificial Intelligence understanding in dermatology.</p><p><strong>Objectives: </strong>In this paper, we aim to explore dermoscopic structures for the classification of melanoma lesions. The training of AI models faces a challenge known as brittleness, where small changes in input images impact the classification. A study explored AI vulnerability in discerning melanoma from benign lesions using features of size, color, and shape. Tests with artificial and natural variations revealed a notable decline in accuracy, emphasizing the necessity for additional information, such as dermoscopic structures.</p><p><strong>Methodology: </strong>The study utilizes datasets with clinically marked dermoscopic images examined by expert clinicians. Transformers and CNN-based models are employed to classify these images based on dermoscopic structures. Classification results are validated using feature visualization. To assess model susceptibility to image variations, classifiers are evaluated on test sets with original, duplicated, and digitally modified images. Additionally, testing is done on ISIC 2016 images. The study focuses on three dermoscopic structures crucial for melanoma detection: Blue-white veil, dots/globules, and streaks.</p><p><strong>Results: </strong>In evaluating model performance, adding convolutions to Vision Transformers proves highly effective for achieving up to 98% accuracy. CNN architectures like VGG-16 and DenseNet-121 reach 50-60% accuracy, performing best with features other than dermoscopic structures. Vision Transformers without convolutions exhibit reduced accuracy on diverse test sets, revealing their brittleness. OpenAI Clip, a pre-trained model, consistently performs well across various test sets. To address brittleness, a mitigation method involving extensive data augmentation during training and 23 transformed duplicates during test time, sustains accuracy.</p><p><strong>Conclusions: </strong>This paper proposes a melanoma classification scheme utilizing three dermoscopic structures across Ph2 and Derm7pt datasets. The study addresses AI susceptibility to image variations. Despite a small dataset, future work suggests collecting more annotated datasets and automatic computation of dermoscopic structural features.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1366312"},"PeriodicalIF":3.1,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10999676/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140869541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-05eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1356116
Yuri Bogomolov, Alexander Belyi, Stanislav Sobolevsky
Introduction: Urban mobility patterns are crucial for effective urban and transportation planning. This study investigates the dynamics of urban mobility in Brno, Czech Republic, utilizing the rich dataset provided by passive mobile phone data. Understanding these patterns is essential for optimizing infrastructure and planning strategies.
Methods: We developed a methodological framework that incorporates bidirectional commute flows and integrates both urban and suburban commute networks. This comprehensive approach allows for a detailed representation of Brno's mobility landscape. By employing clustering techniques, we aimed to identify distinct mobility patterns within the city.
Results: Our analysis revealed consistent structural features within Brno's mobility patterns. We identified three distinct clusters: a central business district, residential communities, and an intermediate hybrid cluster. These clusters highlight the diversity of mobility demands across different parts of the city.
Discussion: The study demonstrates the significant potential of passive mobile phone data in enhancing our understanding of urban mobility patterns. The insights gained from intraday mobility data are invaluable for transportation planning decisions, allowing for the optimization of infrastructure utilization. The identification of distinct mobility patterns underscores the practical utility of our methodological advancements in informing more effective and efficient transportation planning strategies.
{"title":"Urban delineation through a prism of intraday commute patterns.","authors":"Yuri Bogomolov, Alexander Belyi, Stanislav Sobolevsky","doi":"10.3389/fdata.2024.1356116","DOIUrl":"https://doi.org/10.3389/fdata.2024.1356116","url":null,"abstract":"<p><strong>Introduction: </strong>Urban mobility patterns are crucial for effective urban and transportation planning. This study investigates the dynamics of urban mobility in Brno, Czech Republic, utilizing the rich dataset provided by passive mobile phone data. Understanding these patterns is essential for optimizing infrastructure and planning strategies.</p><p><strong>Methods: </strong>We developed a methodological framework that incorporates bidirectional commute flows and integrates both urban and suburban commute networks. This comprehensive approach allows for a detailed representation of Brno's mobility landscape. By employing clustering techniques, we aimed to identify distinct mobility patterns within the city.</p><p><strong>Results: </strong>Our analysis revealed consistent structural features within Brno's mobility patterns. We identified three distinct clusters: a central business district, residential communities, and an intermediate hybrid cluster. These clusters highlight the diversity of mobility demands across different parts of the city.</p><p><strong>Discussion: </strong>The study demonstrates the significant potential of passive mobile phone data in enhancing our understanding of urban mobility patterns. The insights gained from intraday mobility data are invaluable for transportation planning decisions, allowing for the optimization of infrastructure utilization. The identification of distinct mobility patterns underscores the practical utility of our methodological advancements in informing more effective and efficient transportation planning strategies.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1356116"},"PeriodicalIF":3.1,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10948430/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140177714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We aimed to develop, train, and validate machine learning models for predicting preterm birth (<37 weeks' gestation) in singleton pregnancies at different gestational intervals. Models were developed based on complete data from 22,603 singleton pregnancies from a prospective population-based cohort study that was conducted in 51 midwifery clinics and hospitals in Wenzhou City of China between 2014 and 2016. We applied Catboost, Random Forest, Stacked Model, Deep Neural Networks (DNN), and Support Vector Machine (SVM) algorithms, as well as logistic regression, to conduct feature selection and predictive modeling. Feature selection was implemented based on permutation-based feature importance lists derived from the machine learning models including all features, using a balanced training data set. To develop prediction models, the top 10%, 25%, and 50% most important predictive features were selected. Prediction models were developed with the training data set with 5-fold cross-validation for internal validation. Model performance was assessed using area under the receiver operating curve (AUC) values. The CatBoost-based prediction model after 26 weeks' gestation performed best with an AUC value of 0.70 (0.67, 0.73), accuracy of 0.81, sensitivity of 0.47, and specificity of 0.83. Number of antenatal care visits before 24 weeks' gestation, aspartate aminotransferase level at registration, symphysis fundal height, maternal weight, abdominal circumference, and blood pressure emerged as strong predictors after 26 completed weeks. The application of machine learning on pregnancy surveillance data is a promising approach to predict preterm birth and we identified several modifiable antenatal predictors.
我们的目标是开发、训练和验证预测早产的机器学习模型 (
{"title":"Predicting risk of preterm birth in singleton pregnancies using machine learning algorithms.","authors":"Qiu-Yan Yu, Ying Lin, Yu-Run Zhou, Xin-Jun Yang, Joris Hemelaar","doi":"10.3389/fdata.2024.1291196","DOIUrl":"10.3389/fdata.2024.1291196","url":null,"abstract":"<p><p>We aimed to develop, train, and validate machine learning models for predicting preterm birth (<37 weeks' gestation) in singleton pregnancies at different gestational intervals. Models were developed based on complete data from 22,603 singleton pregnancies from a prospective population-based cohort study that was conducted in 51 midwifery clinics and hospitals in Wenzhou City of China between 2014 and 2016. We applied Catboost, Random Forest, Stacked Model, Deep Neural Networks (DNN), and Support Vector Machine (SVM) algorithms, as well as logistic regression, to conduct feature selection and predictive modeling. Feature selection was implemented based on permutation-based feature importance lists derived from the machine learning models including all features, using a balanced training data set. To develop prediction models, the top 10%, 25%, and 50% most important predictive features were selected. Prediction models were developed with the training data set with 5-fold cross-validation for internal validation. Model performance was assessed using area under the receiver operating curve (AUC) values. The CatBoost-based prediction model after 26 weeks' gestation performed best with an AUC value of 0.70 (0.67, 0.73), accuracy of 0.81, sensitivity of 0.47, and specificity of 0.83. Number of antenatal care visits before 24 weeks' gestation, aspartate aminotransferase level at registration, symphysis fundal height, maternal weight, abdominal circumference, and blood pressure emerged as strong predictors after 26 completed weeks. The application of machine learning on pregnancy surveillance data is a promising approach to predict preterm birth and we identified several modifiable antenatal predictors.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1291196"},"PeriodicalIF":3.1,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10941650/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140144558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-29eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1266031
Dmitry Kolobkov, Satyarth Mishra Sharma, Aleksandr Medvedev, Mikhail Lebedev, Egor Kosaretskiy, Ruslan Vakhitov
Combining training data from multiple sources increases sample size and reduces confounding, leading to more accurate and less biased machine learning models. In healthcare, however, direct pooling of data is often not allowed by data custodians who are accountable for minimizing the exposure of sensitive information. Federated learning offers a promising solution to this problem by training a model in a decentralized manner thus reducing the risks of data leakage. Although there is increasing utilization of federated learning on clinical data, its efficacy on individual-level genomic data has not been studied. This study lays the groundwork for the adoption of federated learning for genomic data by investigating its applicability in two scenarios: phenotype prediction on the UK Biobank data and ancestry prediction on the 1000 Genomes Project data. We show that federated models trained on data split into independent nodes achieve performance close to centralized models, even in the presence of significant inter-node heterogeneity. Additionally, we investigate how federated model accuracy is affected by communication frequency and suggest approaches to reduce computational complexity or communication costs.
{"title":"Efficacy of federated learning on genomic data: a study on the UK Biobank and the 1000 Genomes Project.","authors":"Dmitry Kolobkov, Satyarth Mishra Sharma, Aleksandr Medvedev, Mikhail Lebedev, Egor Kosaretskiy, Ruslan Vakhitov","doi":"10.3389/fdata.2024.1266031","DOIUrl":"10.3389/fdata.2024.1266031","url":null,"abstract":"<p><p>Combining training data from multiple sources increases sample size and reduces confounding, leading to more accurate and less biased machine learning models. In healthcare, however, direct pooling of data is often not allowed by data custodians who are accountable for minimizing the exposure of sensitive information. Federated learning offers a promising solution to this problem by training a model in a decentralized manner thus reducing the risks of data leakage. Although there is increasing utilization of federated learning on clinical data, its efficacy on individual-level genomic data has not been studied. This study lays the groundwork for the adoption of federated learning for genomic data by investigating its applicability in two scenarios: phenotype prediction on the UK Biobank data and ancestry prediction on the 1000 Genomes Project data. We show that federated models trained on data split into independent nodes achieve performance close to centralized models, even in the presence of significant inter-node heterogeneity. Additionally, we investigate how federated model accuracy is affected by communication frequency and suggest approaches to reduce computational complexity or communication costs.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1266031"},"PeriodicalIF":3.1,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10937521/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140133172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-26eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1304439
Mathias Uta, Alexander Felfernig, Viet-Man Le, Thi Ngoc Trang Tran, Damian Garber, Sebastian Lubos, Tamim Burgstaller
Recommender systems are decision support systems that help users to identify items of relevance from a potentially large set of alternatives. In contrast to the mainstream recommendation approaches of collaborative filtering and content-based filtering, knowledge-based recommenders exploit semantic user preference knowledge, item knowledge, and recommendation knowledge, to identify user-relevant items which is of specific relevance when dealing with complex and high-involvement items. Such recommenders are primarily applied in scenarios where users specify (and revise) their preferences, and related recommendations are determined on the basis of constraints or attribute-level similarity metrics. In this article, we provide an overview of the existing state-of-the-art in knowledge-based recommender systems. Different related recommendation techniques are explained on the basis of a working example from the domain of survey software services. On the basis of our analysis, we outline different directions for future research.
{"title":"Knowledge-based recommender systems: overview and research directions.","authors":"Mathias Uta, Alexander Felfernig, Viet-Man Le, Thi Ngoc Trang Tran, Damian Garber, Sebastian Lubos, Tamim Burgstaller","doi":"10.3389/fdata.2024.1304439","DOIUrl":"10.3389/fdata.2024.1304439","url":null,"abstract":"<p><p>Recommender systems are decision support systems that help users to identify items of relevance from a potentially large set of alternatives. In contrast to the mainstream recommendation approaches of collaborative filtering and content-based filtering, knowledge-based recommenders exploit semantic user preference knowledge, item knowledge, and recommendation knowledge, to identify user-relevant items which is of specific relevance when dealing with complex and high-involvement items. Such recommenders are primarily applied in scenarios where users specify (and revise) their preferences, and related recommendations are determined on the basis of constraints or attribute-level similarity metrics. In this article, we provide an overview of the existing state-of-the-art in knowledge-based recommender systems. Different related recommendation techniques are explained on the basis of a working example from the domain of survey software services. On the basis of our analysis, we outline different directions for future research.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1304439"},"PeriodicalIF":3.1,"publicationDate":"2024-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10925703/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140102782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}