Dimitris Kastaniotis, Dimitrios Tsourounis, S. Fotopoulos
{"title":"基于时间卷积网络的唇读建模在医疗支持中的应用","authors":"Dimitris Kastaniotis, Dimitrios Tsourounis, S. Fotopoulos","doi":"10.1109/CISP-BMEI51763.2020.9263634","DOIUrl":null,"url":null,"abstract":"Automated Lip Reading (LR) task is the process of predicting a spoken word using only visual information of a sequence of frames. This sequence modeling task has been approached with Convolutional Neural Networks (CNNs) combined with Long Short-Term Memory networks (LSTM). In this work, a novel scheme for modeling LR sequences with a combination of Temporal Convolutional Networks (TCN) driven by the feature vectors produced by CNN is presented. More specifically, the contribution of this work is two-fold. Firstly, a novel approach that utilize the TCN topology as an alternative way to deal with the sequential data of the LR task is presented. Secondly, this approach is evaluated on a new real-world challenging dataset particularly designed for the problem of LR in Greek words related to biomedical and clinical conditions. More specifically, the Greek words of the dataset are selected to be words that a patient would like to communicate when receiving medical treatment using the frontal camera of a mobile phone. Experimental results indicate that the proposed CNN-TCN architecture can surpass recurrent oriented approaches based on CNN-LSTM while also providing major benefits for deployment in model hardware architectures and more stability during training.","PeriodicalId":346757,"journal":{"name":"2020 13th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Lip Reading modeling with Temporal Convolutional Networks for medical support applications\",\"authors\":\"Dimitris Kastaniotis, Dimitrios Tsourounis, S. Fotopoulos\",\"doi\":\"10.1109/CISP-BMEI51763.2020.9263634\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automated Lip Reading (LR) task is the process of predicting a spoken word using only visual information of a sequence of frames. This sequence modeling task has been approached with Convolutional Neural Networks (CNNs) combined with Long Short-Term Memory networks (LSTM). In this work, a novel scheme for modeling LR sequences with a combination of Temporal Convolutional Networks (TCN) driven by the feature vectors produced by CNN is presented. More specifically, the contribution of this work is two-fold. Firstly, a novel approach that utilize the TCN topology as an alternative way to deal with the sequential data of the LR task is presented. Secondly, this approach is evaluated on a new real-world challenging dataset particularly designed for the problem of LR in Greek words related to biomedical and clinical conditions. More specifically, the Greek words of the dataset are selected to be words that a patient would like to communicate when receiving medical treatment using the frontal camera of a mobile phone. Experimental results indicate that the proposed CNN-TCN architecture can surpass recurrent oriented approaches based on CNN-LSTM while also providing major benefits for deployment in model hardware architectures and more stability during training.\",\"PeriodicalId\":346757,\"journal\":{\"name\":\"2020 13th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 13th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CISP-BMEI51763.2020.9263634\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 13th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CISP-BMEI51763.2020.9263634","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Lip Reading modeling with Temporal Convolutional Networks for medical support applications
Automated Lip Reading (LR) task is the process of predicting a spoken word using only visual information of a sequence of frames. This sequence modeling task has been approached with Convolutional Neural Networks (CNNs) combined with Long Short-Term Memory networks (LSTM). In this work, a novel scheme for modeling LR sequences with a combination of Temporal Convolutional Networks (TCN) driven by the feature vectors produced by CNN is presented. More specifically, the contribution of this work is two-fold. Firstly, a novel approach that utilize the TCN topology as an alternative way to deal with the sequential data of the LR task is presented. Secondly, this approach is evaluated on a new real-world challenging dataset particularly designed for the problem of LR in Greek words related to biomedical and clinical conditions. More specifically, the Greek words of the dataset are selected to be words that a patient would like to communicate when receiving medical treatment using the frontal camera of a mobile phone. Experimental results indicate that the proposed CNN-TCN architecture can surpass recurrent oriented approaches based on CNN-LSTM while also providing major benefits for deployment in model hardware architectures and more stability during training.