Ryo Imaizumi, Ryo Masumura, Sayaka Shiota, H. Kiya
End-to-end systems have demonstrated state-of-the-art performance on many tasks related to automatic speech recognition (ASR) and dialect identification (DID). In this paper, we propose multi-task learning of Japanese DID and multi-dialect ASR (MD-ASR) systems with end-to-end models. Since Japanese dialects have variety in both linguistic and acoustic aspects of each dialect, Japanese DID requires simultaneously considering linguistic and acoustic features. One solution realizing Japanese DID using these features is to use transcriptions from ASR when performing DID. However, transcribing Japanese multi-dialect speech into text is regarded as a challenging task in ASR because there are big gaps in linguistic and acoustic features between a dialect and standard Japanese. One solution is dialect-aware ASR modeling, which means DID is performed with ASR. Therefore, the multi-task learning framework of Japanese DID and ASR is proposed to represent the dependency of them. We explore three systems as part of the proposed framework, changing the order in which DID and ASR are performed. In the experiments, Japanese multi-dialect ASR and DID tests were conducted on our home-made Japanese multi-dialect database and a standard Japanese database. The proposed transformer-based systems outperformed the conventional single task systems on both DID and ASR tests.
{"title":"End-to-end Japanese Multi-dialect Speech Recognition and Dialect Identification with Multi-task Learning","authors":"Ryo Imaizumi, Ryo Masumura, Sayaka Shiota, H. Kiya","doi":"10.1561/116.00000045","DOIUrl":"https://doi.org/10.1561/116.00000045","url":null,"abstract":"End-to-end systems have demonstrated state-of-the-art performance on many tasks related to automatic speech recognition (ASR) and dialect identification (DID). In this paper, we propose multi-task learning of Japanese DID and multi-dialect ASR (MD-ASR) systems with end-to-end models. Since Japanese dialects have variety in both linguistic and acoustic aspects of each dialect, Japanese DID requires simultaneously considering linguistic and acoustic features. One solution realizing Japanese DID using these features is to use transcriptions from ASR when performing DID. However, transcribing Japanese multi-dialect speech into text is regarded as a challenging task in ASR because there are big gaps in linguistic and acoustic features between a dialect and standard Japanese. One solution is dialect-aware ASR modeling, which means DID is performed with ASR. Therefore, the multi-task learning framework of Japanese DID and ASR is proposed to represent the dependency of them. We explore three systems as part of the proposed framework, changing the order in which DID and ASR are performed. In the experiments, Japanese multi-dialect ASR and DID tests were conducted on our home-made Japanese multi-dialect database and a standard Japanese database. The proposed transformer-based systems outperformed the conventional single task systems on both DID and ASR tests.","PeriodicalId":44812,"journal":{"name":"APSIPA Transactions on Signal and Information Processing","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67081542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Identifying Code Reading Strategies in Debugging using STA with a Tolerance Algorithm","authors":"Christine Lourrine S. Tablatin, M. M. Rodrigo","doi":"10.1561/116.00000040","DOIUrl":"https://doi.org/10.1561/116.00000040","url":null,"abstract":"","PeriodicalId":44812,"journal":{"name":"APSIPA Transactions on Signal and Information Processing","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67081421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DeepFake and its Enabling Techniques: A Review","authors":"R. Brooks, Yefeng Yuan, Yuhong Liu, Haiquan Chen","doi":"10.1561/116.00000024","DOIUrl":"https://doi.org/10.1561/116.00000024","url":null,"abstract":"","PeriodicalId":44812,"journal":{"name":"APSIPA Transactions on Signal and Information Processing","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67079629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
W. Kumwilaisak, Peerawat Pannattee, C. Hansakunbuntheung, N. Thatphithakkul
This paper proposes a novel method to improve the accuracy of the American Sign Language fingerspelling recognition. Video sequences from the training set of the “ChicagoFSWild” dataset are first utilized for training a deep neural network of weakly supervised learning to generate frame labels from a sequence label automatically. The network of weakly supervised learning contains the AlexNet and the LSTM. This trained network generates a collection of frame-labeled images from the training video sequences that have Levenshtein distance between the predicted sequence and the sequence label equal to zero. The negative and positive pairs of all fingerspelling gestures are randomly formed from the collected image set. These pairs are adopted to train the Siamese network of the ResNet-50 and the projection function to produce efficient feature representations. The trained Resnet-50 and the projection function are concatenated with the bidirectional LSTM, a fully connected layer, and a softmax layer to form a deep neural network for the American Sign Language fingerspelling recognition. With the training video sequences, video frames corresponding to the video sequences that have Levenshtein distance between the predicted sequence and the sequence label equal to zero are added to the collected image set. The updated collected image set is used to train the Siamese network. The training process, from training the Siamese network to the update of the collected image set, is iterated until the image recognition performance is not further enhanced. The experimental results from the “ChicagoFSWild” dataset show that the proposed method surpasses the existing works in terms of the character error rate.
{"title":"American Sign Language Fingerspelling Recognition in the Wild with Iterative Language Model Construction","authors":"W. Kumwilaisak, Peerawat Pannattee, C. Hansakunbuntheung, N. Thatphithakkul","doi":"10.1561/116.00000003","DOIUrl":"https://doi.org/10.1561/116.00000003","url":null,"abstract":"This paper proposes a novel method to improve the accuracy of the American Sign Language fingerspelling recognition. Video sequences from the training set of the “ChicagoFSWild” dataset are first utilized for training a deep neural network of weakly supervised learning to generate frame labels from a sequence label automatically. The network of weakly supervised learning contains the AlexNet and the LSTM. This trained network generates a collection of frame-labeled images from the training video sequences that have Levenshtein distance between the predicted sequence and the sequence label equal to zero. The negative and positive pairs of all fingerspelling gestures are randomly formed from the collected image set. These pairs are adopted to train the Siamese network of the ResNet-50 and the projection function to produce efficient feature representations. The trained Resnet-50 and the projection function are concatenated with the bidirectional LSTM, a fully connected layer, and a softmax layer to form a deep neural network for the American Sign Language fingerspelling recognition. With the training video sequences, video frames corresponding to the video sequences that have Levenshtein distance between the predicted sequence and the sequence label equal to zero are added to the collected image set. The updated collected image set is used to train the Siamese network. The training process, from training the Siamese network to the update of the collected image set, is iterated until the image recognition performance is not further enhanced. The experimental results from the “ChicagoFSWild” dataset show that the proposed method surpasses the existing works in terms of the character error rate.","PeriodicalId":44812,"journal":{"name":"APSIPA Transactions on Signal and Information Processing","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67079886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Küstner, Jiazhen Pan, Christopher Gilliam, H. Qi, G. Cruz, K. Hammernik, T. Blu, D. Rueckert, René M. Botnar, C. Prieto, S. Gatidis
Respiratory motion can cause artifacts in magnetic resonance imaging of the body trunk if patients cannot hold their breath or triggered acquisitions are not practical. Retrospective correction strategies usually cope with motion by fast imaging sequences under free-movement conditions followed by motion binning based on motion traces. These acquisitions yield sub-Nyquist sampled and motion-resolved k-space data. Motion states are linked to each other by non-rigid deformation fields. Usually, motion registration is formulated in image space which can however be impaired by aliasing artifacts or by estimation from low-resolution images. Subsequently, any motion-corrected reconstruction can be biased by errors in the deformation fields. In this work, we propose a deep-learning based motion-corrected 4D (3D spatial + time) image reconstruction which combines a non-rigid registration network and a 4D reconstruction network. Non-rigid motion is estimated in k-space and incorporated into the reconstruction network. The proposed method is evaluated on in-vivo 4D motion-resolved magnetic resonance images of patients with suspected liver or lung metastases and healthy subjects. The proposed approach provides 4D motion-corrected images and deformation fields. It enables a ∼ 14 × accelerated acquisition with a 25-fold faster reconstruction than comparable approaches under consistent preservation of image quality for changing patients and motion patterns.
{"title":"Self-Supervised Motion-Corrected Image Reconstruction Network for 4D Magnetic Resonance Imaging of the Body Trunk","authors":"T. Küstner, Jiazhen Pan, Christopher Gilliam, H. Qi, G. Cruz, K. Hammernik, T. Blu, D. Rueckert, René M. Botnar, C. Prieto, S. Gatidis","doi":"10.1561/116.00000039","DOIUrl":"https://doi.org/10.1561/116.00000039","url":null,"abstract":"Respiratory motion can cause artifacts in magnetic resonance imaging of the body trunk if patients cannot hold their breath or triggered acquisitions are not practical. Retrospective correction strategies usually cope with motion by fast imaging sequences under free-movement conditions followed by motion binning based on motion traces. These acquisitions yield sub-Nyquist sampled and motion-resolved k-space data. Motion states are linked to each other by non-rigid deformation fields. Usually, motion registration is formulated in image space which can however be impaired by aliasing artifacts or by estimation from low-resolution images. Subsequently, any motion-corrected reconstruction can be biased by errors in the deformation fields. In this work, we propose a deep-learning based motion-corrected 4D (3D spatial + time) image reconstruction which combines a non-rigid registration network and a 4D reconstruction network. Non-rigid motion is estimated in k-space and incorporated into the reconstruction network. The proposed method is evaluated on in-vivo 4D motion-resolved magnetic resonance images of patients with suspected liver or lung metastases and healthy subjects. The proposed approach provides 4D motion-corrected images and deformation fields. It enables a ∼ 14 × accelerated acquisition with a 25-fold faster reconstruction than comparable approaches under consistent preservation of image quality for changing patients and motion patterns.","PeriodicalId":44812,"journal":{"name":"APSIPA Transactions on Signal and Information Processing","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67081415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Ling, C.-C. Jay Kuo, G. Sullivan, Dong Xu, Shan Liu, H. Hang, Wen-Hsiao Peng, Jiaying Liu
{"title":"The Future of Video Coding","authors":"N. Ling, C.-C. Jay Kuo, G. Sullivan, Dong Xu, Shan Liu, H. Hang, Wen-Hsiao Peng, Jiaying Liu","doi":"10.1561/116.00000044","DOIUrl":"https://doi.org/10.1561/116.00000044","url":null,"abstract":"","PeriodicalId":44812,"journal":{"name":"APSIPA Transactions on Signal and Information Processing","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67081521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Barni, Y. Fang, Yuhong Liu, Laura Robinson, K. Sasahara, Subramaniam Vincent, Xinchao Wang, Zhizheng Wu
Recently, the viral propagation of mis/disinformation has raised significant concerns from both academia and industry. This problem is particularly difficult because on the one hand, rapidly evolving technology makes it much cheaper and easier to manipulate and propagate social media information. On the other hand, the complexity of human psychology and sociology makes the understanding, prediction and prevention of users' involvement in mis/disinformation propagation very difficult. This themed series on "Multi-Disciplinary Dis/Misinformation Analysis and Countermeasures" aims to bring the attention and efforts from researchers in relevant disciplines together to tackle this challenging problem. In addition, on October 20th, 2021, and March 7th 2022, some of the guest editorial team members organized two panel discussions on "Social Media Disinformation and its Impact on Public Health During the COVID-19 Pandemic," and on "Dis/Misinformation Analysis and Countermeasures - A Computational Viewpoint." This article summarizes the key discussion items at these two panels and hopes to shed light on the future directions.
{"title":"Combating Misinformation/ Disinformation in Online Social Media: A Multidisciplinary View","authors":"M. Barni, Y. Fang, Yuhong Liu, Laura Robinson, K. Sasahara, Subramaniam Vincent, Xinchao Wang, Zhizheng Wu","doi":"10.1561/116.0000127","DOIUrl":"https://doi.org/10.1561/116.0000127","url":null,"abstract":"Recently, the viral propagation of mis/disinformation has raised significant concerns from both academia and industry. This problem is particularly difficult because on the one hand, rapidly evolving technology makes it much cheaper and easier to manipulate and propagate social media information. On the other hand, the complexity of human psychology and sociology makes the understanding, prediction and prevention of users' involvement in mis/disinformation propagation very difficult. This themed series on \"Multi-Disciplinary Dis/Misinformation Analysis and Countermeasures\" aims to bring the attention and efforts from researchers in relevant disciplines together to tackle this challenging problem. In addition, on October 20th, 2021, and March 7th 2022, some of the guest editorial team members organized two panel discussions on \"Social Media Disinformation and its Impact on Public Health During the COVID-19 Pandemic,\" and on \"Dis/Misinformation Analysis and Countermeasures - A Computational Viewpoint.\" This article summarizes the key discussion items at these two panels and hopes to shed light on the future directions.","PeriodicalId":44812,"journal":{"name":"APSIPA Transactions on Signal and Information Processing","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67081795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"UHP-SOT++: An Unsupervised Lightweight Single Object Tracker","authors":"Zhiruo Zhou, Hongyu Fu, Suya You, C. J. Kuo","doi":"10.1561/116.00000008","DOIUrl":"https://doi.org/10.1561/116.00000008","url":null,"abstract":"","PeriodicalId":44812,"journal":{"name":"APSIPA Transactions on Signal and Information Processing","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67079945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jingjing Meng, Xilin Chen, Jurgen Gall, Chang-Su Kim, Zicheng Liu, A. Piva, Junsong Yuan
{"title":"The Future of Computer Vision","authors":"Jingjing Meng, Xilin Chen, Jurgen Gall, Chang-Su Kim, Zicheng Liu, A. Piva, Junsong Yuan","doi":"10.1561/116.00000009","DOIUrl":"https://doi.org/10.1561/116.00000009","url":null,"abstract":"","PeriodicalId":44812,"journal":{"name":"APSIPA Transactions on Signal and Information Processing","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67079954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Machine Learning for Wireless Communication: An Overview","authors":"Zijian Cao, Huan Zhang, Le Liang, Geoffrey Ye Li","doi":"10.1561/116.00000029","DOIUrl":"https://doi.org/10.1561/116.00000029","url":null,"abstract":"","PeriodicalId":44812,"journal":{"name":"APSIPA Transactions on Signal and Information Processing","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67081216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}