{"title":"多模态机器学习用于精神健康领域的语言和语音标记识别。","authors":"Georgios Drougkas, Erwin M Bakker, Marco Spruit","doi":"10.1186/s12911-024-02772-0","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>There are numerous papers focusing on diagnosing mental health disorders using unimodal and multimodal approaches. However, our literature review shows that the majority of these studies either use unimodal approaches to diagnose a variety of mental disorders or employ multimodal approaches to diagnose a single mental disorder instead. In this research we combine these approaches by first identifying and compiling an extensive list of mental health disorder markers for a wide range of mental illnesses which have been used for both unimodal and multimodal methods, which is subsequently used for determining whether the multimodal approach can outperform the unimodal approaches.</p><p><strong>Methods: </strong>For this study we used the well known and robust multimodal DAIC-WOZ dataset derived from clinical interviews. Here we focus on the modalities text and audio. First, we constructed two unimodal models to analyze text and audio data, respectively, using feature extraction, based on the extensive list of mental disorder markers that has been identified and compiled by us using related and earlier studies. For our unimodal text model, we also propose an initial pragmatic binary label creation process. Then, we employed an early fusion strategy to combine our text and audio features before model processing. Our fused feature set was then given as input to various baseline machine and deep learning algorithms, including Support Vector Machines, Logistic Regressions, Random Forests, and fully connected neural network classifiers (Dense Layers). Ultimately, the performance of our models was evaluated using accuracy, AUC-ROC score, and two F1 metrics: one for the prediction of positive cases and one for the prediction of negative cases.</p><p><strong>Results: </strong>Overall, the unimodal text models achieved an accuracy ranging from 78% to 87% and an AUC-ROC score between 85% and 93%, while the unimodal audio models attained an accuracy of 64% to 72% and AUC-ROC scores of 53% to 75%. The experimental results indicated that our multimodal models achieved comparable accuracy (ranging from 80% to 87%) and AUC-ROC scores (between 84% and 93%) to those of the unimodal text models. However, the majority of the multimodal models managed to outperform the unimodal models in F1 scores, particularly in the F1 score of the positive class (F1 of 1s), which reflects how well the models perform in identifying the presence of a marker.</p><p><strong>Conclusions: </strong>In conclusion, by refining the binary label creation process and by improving the feature engineering process of the unimodal acoustic model, we argue that the multimodal model can outperform both unimodal approaches. This study underscores the importance of multimodal integration in the field of mental health diagnostics and sets the stage for future research to explore more sophisticated fusion techniques and deeper learning models.</p>","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":"24 1","pages":"354"},"PeriodicalIF":3.3000,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multimodal machine learning for language and speech markers identification in mental health.\",\"authors\":\"Georgios Drougkas, Erwin M Bakker, Marco Spruit\",\"doi\":\"10.1186/s12911-024-02772-0\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>There are numerous papers focusing on diagnosing mental health disorders using unimodal and multimodal approaches. However, our literature review shows that the majority of these studies either use unimodal approaches to diagnose a variety of mental disorders or employ multimodal approaches to diagnose a single mental disorder instead. In this research we combine these approaches by first identifying and compiling an extensive list of mental health disorder markers for a wide range of mental illnesses which have been used for both unimodal and multimodal methods, which is subsequently used for determining whether the multimodal approach can outperform the unimodal approaches.</p><p><strong>Methods: </strong>For this study we used the well known and robust multimodal DAIC-WOZ dataset derived from clinical interviews. Here we focus on the modalities text and audio. First, we constructed two unimodal models to analyze text and audio data, respectively, using feature extraction, based on the extensive list of mental disorder markers that has been identified and compiled by us using related and earlier studies. For our unimodal text model, we also propose an initial pragmatic binary label creation process. Then, we employed an early fusion strategy to combine our text and audio features before model processing. Our fused feature set was then given as input to various baseline machine and deep learning algorithms, including Support Vector Machines, Logistic Regressions, Random Forests, and fully connected neural network classifiers (Dense Layers). Ultimately, the performance of our models was evaluated using accuracy, AUC-ROC score, and two F1 metrics: one for the prediction of positive cases and one for the prediction of negative cases.</p><p><strong>Results: </strong>Overall, the unimodal text models achieved an accuracy ranging from 78% to 87% and an AUC-ROC score between 85% and 93%, while the unimodal audio models attained an accuracy of 64% to 72% and AUC-ROC scores of 53% to 75%. The experimental results indicated that our multimodal models achieved comparable accuracy (ranging from 80% to 87%) and AUC-ROC scores (between 84% and 93%) to those of the unimodal text models. However, the majority of the multimodal models managed to outperform the unimodal models in F1 scores, particularly in the F1 score of the positive class (F1 of 1s), which reflects how well the models perform in identifying the presence of a marker.</p><p><strong>Conclusions: </strong>In conclusion, by refining the binary label creation process and by improving the feature engineering process of the unimodal acoustic model, we argue that the multimodal model can outperform both unimodal approaches. This study underscores the importance of multimodal integration in the field of mental health diagnostics and sets the stage for future research to explore more sophisticated fusion techniques and deeper learning models.</p>\",\"PeriodicalId\":9340,\"journal\":{\"name\":\"BMC Medical Informatics and Decision Making\",\"volume\":\"24 1\",\"pages\":\"354\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2024-11-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Medical Informatics and Decision Making\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1186/s12911-024-02772-0\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MEDICAL INFORMATICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-024-02772-0","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
Multimodal machine learning for language and speech markers identification in mental health.
Background: There are numerous papers focusing on diagnosing mental health disorders using unimodal and multimodal approaches. However, our literature review shows that the majority of these studies either use unimodal approaches to diagnose a variety of mental disorders or employ multimodal approaches to diagnose a single mental disorder instead. In this research we combine these approaches by first identifying and compiling an extensive list of mental health disorder markers for a wide range of mental illnesses which have been used for both unimodal and multimodal methods, which is subsequently used for determining whether the multimodal approach can outperform the unimodal approaches.
Methods: For this study we used the well known and robust multimodal DAIC-WOZ dataset derived from clinical interviews. Here we focus on the modalities text and audio. First, we constructed two unimodal models to analyze text and audio data, respectively, using feature extraction, based on the extensive list of mental disorder markers that has been identified and compiled by us using related and earlier studies. For our unimodal text model, we also propose an initial pragmatic binary label creation process. Then, we employed an early fusion strategy to combine our text and audio features before model processing. Our fused feature set was then given as input to various baseline machine and deep learning algorithms, including Support Vector Machines, Logistic Regressions, Random Forests, and fully connected neural network classifiers (Dense Layers). Ultimately, the performance of our models was evaluated using accuracy, AUC-ROC score, and two F1 metrics: one for the prediction of positive cases and one for the prediction of negative cases.
Results: Overall, the unimodal text models achieved an accuracy ranging from 78% to 87% and an AUC-ROC score between 85% and 93%, while the unimodal audio models attained an accuracy of 64% to 72% and AUC-ROC scores of 53% to 75%. The experimental results indicated that our multimodal models achieved comparable accuracy (ranging from 80% to 87%) and AUC-ROC scores (between 84% and 93%) to those of the unimodal text models. However, the majority of the multimodal models managed to outperform the unimodal models in F1 scores, particularly in the F1 score of the positive class (F1 of 1s), which reflects how well the models perform in identifying the presence of a marker.
Conclusions: In conclusion, by refining the binary label creation process and by improving the feature engineering process of the unimodal acoustic model, we argue that the multimodal model can outperform both unimodal approaches. This study underscores the importance of multimodal integration in the field of mental health diagnostics and sets the stage for future research to explore more sophisticated fusion techniques and deeper learning models.
期刊介绍:
BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.