Basavaraj Hampiholi, Christian Jarvers, W. Mader, H. Neumann
{"title":"Depthwise Separable Temporal Convolutional Network for Action Segmentation","authors":"Basavaraj Hampiholi, Christian Jarvers, W. Mader, H. Neumann","doi":"10.1109/3DV50981.2020.00073","DOIUrl":null,"url":null,"abstract":"Fine-grained temporal action segmentation in long, untrimmed RGB videos is a key topic in visual human-machine interaction. Recent temporal convolution based approaches either use encoder-decoder(ED) architecture or dilations with doubling factor in consecutive convolution layers to segment actions in videos. However ED networks operate on low temporal resolution and the dilations in successive layers cause gridding artifacts problem. We propose depthwise separable temporal convolution network (DS-TCN) that operates on full temporal resolution and with reduced gridding effects. The basic component of DS-TCN is residual depthwise dilated block (RDDB). We explore the trade-off between large kernels and small dilation rates using RDDB. We show that our DS-TCN is capable of capturing long-term dependencies as well as local temporal cues efficiently. Our evaluation on three benchmark datasets, GTEA, 50Salads, and Breakfast demonstrates that DS-TCN outperforms the existing ED-TCN and dilation based TCN baselines even with comparatively fewer parameters.","PeriodicalId":293399,"journal":{"name":"2020 International Conference on 3D Vision (3DV)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International Conference on 3D Vision (3DV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/3DV50981.2020.00073","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Fine-grained temporal action segmentation in long, untrimmed RGB videos is a key topic in visual human-machine interaction. Recent temporal convolution based approaches either use encoder-decoder(ED) architecture or dilations with doubling factor in consecutive convolution layers to segment actions in videos. However ED networks operate on low temporal resolution and the dilations in successive layers cause gridding artifacts problem. We propose depthwise separable temporal convolution network (DS-TCN) that operates on full temporal resolution and with reduced gridding effects. The basic component of DS-TCN is residual depthwise dilated block (RDDB). We explore the trade-off between large kernels and small dilation rates using RDDB. We show that our DS-TCN is capable of capturing long-term dependencies as well as local temporal cues efficiently. Our evaluation on three benchmark datasets, GTEA, 50Salads, and Breakfast demonstrates that DS-TCN outperforms the existing ED-TCN and dilation based TCN baselines even with comparatively fewer parameters.