TY - JOUR
T1 - Transformer Models and Convolutional Networks with Different Activation Functions for Swallow Classification Using Depth Video Data
AU - Lai, Derek Ka Hei
AU - Cheng, Ethan Shiu Wang
AU - So, Bryan Pak Hei
AU - Mao, Ye Jiao
AU - Cheung, Sophia Ming Yan
AU - Cheung, Daphne Sze Ki
AU - Wong, Duo Wai Chi
AU - Cheung, James Chung Wai
N1 - Funding Information:
This study was supported by the Health and Medical Research Fund from the Health Bureau, Hong Kong (reference number: 19200461); and internal fund from the Research Institute for Smart Ageing, The Hong Kong Polytechnic University.
Publisher Copyright:
© 2023 by the authors.
PY - 2023/7/12
Y1 - 2023/7/12
N2 - Dysphagia is a common geriatric syndrome that might induce serious complications and death. Standard diagnostics using the Videofluoroscopic Swallowing Study (VFSS) or Fiberoptic Evaluation of Swallowing (FEES) are expensive and expose patients to risks, while bedside screening is subjective and might lack reliability. An affordable and accessible instrumented screening is necessary. This study aimed to evaluate the classification performance of Transformer models and convolutional networks in identifying swallowing and non-swallowing tasks through depth video data. Different activation functions (ReLU, LeakyReLU, GELU, ELU, SiLU, and GLU) were then evaluated on the best-performing model. Sixty-five healthy participants (n = 65) were invited to perform swallowing (eating a cracker and drinking water) and non-swallowing tasks (a deep breath and pronouncing vowels: “/eɪ/”, “/iː/”, “/aɪ/”, “/oʊ/”, “/u:/”). Swallowing and non-swallowing were classified by Transformer models (TimeSFormer, Video Vision Transformer (ViViT)), and convolutional neural networks (SlowFast, X3D, and R(2+1)D), respectively. In general, convolutional neural networks outperformed the Transformer models. X3D was the best model with good-to-excellent performance (F1-score: 0.920; adjusted F1-score: 0.885) in classifying swallowing and non-swallowing conditions. Moreover, X3D with its default activation function (ReLU) produced the best results, although LeakyReLU performed better in deep breathing and pronouncing “/aɪ/” tasks. Future studies shall consider collecting more data for pretraining and developing a hyperparameter tuning strategy for activation functions and the high dimensionality video data for Transformer models.
AB - Dysphagia is a common geriatric syndrome that might induce serious complications and death. Standard diagnostics using the Videofluoroscopic Swallowing Study (VFSS) or Fiberoptic Evaluation of Swallowing (FEES) are expensive and expose patients to risks, while bedside screening is subjective and might lack reliability. An affordable and accessible instrumented screening is necessary. This study aimed to evaluate the classification performance of Transformer models and convolutional networks in identifying swallowing and non-swallowing tasks through depth video data. Different activation functions (ReLU, LeakyReLU, GELU, ELU, SiLU, and GLU) were then evaluated on the best-performing model. Sixty-five healthy participants (n = 65) were invited to perform swallowing (eating a cracker and drinking water) and non-swallowing tasks (a deep breath and pronouncing vowels: “/eɪ/”, “/iː/”, “/aɪ/”, “/oʊ/”, “/u:/”). Swallowing and non-swallowing were classified by Transformer models (TimeSFormer, Video Vision Transformer (ViViT)), and convolutional neural networks (SlowFast, X3D, and R(2+1)D), respectively. In general, convolutional neural networks outperformed the Transformer models. X3D was the best model with good-to-excellent performance (F1-score: 0.920; adjusted F1-score: 0.885) in classifying swallowing and non-swallowing conditions. Moreover, X3D with its default activation function (ReLU) produced the best results, although LeakyReLU performed better in deep breathing and pronouncing “/aɪ/” tasks. Future studies shall consider collecting more data for pretraining and developing a hyperparameter tuning strategy for activation functions and the high dimensionality video data for Transformer models.
KW - aspiration pneumonia
KW - computer-aided screening
KW - deep learning
KW - dysphagia
KW - gerontechnology
UR - http://www.scopus.com/inward/record.url?scp=85166230111&partnerID=8YFLogxK
U2 - 10.3390/math11143081
DO - 10.3390/math11143081
M3 - Journal article
AN - SCOPUS:85166230111
SN - 2227-7390
VL - 11
JO - Mathematics
JF - Mathematics
IS - 14
M1 - 3081
ER -