Latent space improved masked reconstruction model for human skeleton-based action recognition

Human skeleton-based action recognition is an important task in the field of computer vision. In recent years, masked autoencoder (MAE) has been used in various fields due to its powerful self-supervised learning ability and has achieved good results in masked data reconstruction tasks. However, in...

Full description

Saved in:
Bibliographic Details
Main Authors: Enqing Chen, Xueting Wang, Xin Guo, Ying Zhu, Dong Li
Format: Article
Language:English
Published: Frontiers Media S.A. 2025-02-01
Series:Frontiers in Neurorobotics
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fnbot.2025.1482281/full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1823856499659636736
author Enqing Chen
Xueting Wang
Xin Guo
Ying Zhu
Dong Li
author_facet Enqing Chen
Xueting Wang
Xin Guo
Ying Zhu
Dong Li
author_sort Enqing Chen
collection DOAJ
description Human skeleton-based action recognition is an important task in the field of computer vision. In recent years, masked autoencoder (MAE) has been used in various fields due to its powerful self-supervised learning ability and has achieved good results in masked data reconstruction tasks. However, in visual classification tasks such as action recognition, the limited ability of the encoder to learn features in the autoencoder structure results in poor classification performance. We propose to enhance the encoder's feature extraction ability in classification tasks by leveraging the latent space of variational autoencoder (VAE) and further replace it with the latent space of vector quantized variational autoencoder (VQVAE). The constructed models are called SkeletonMVAE and SkeletonMVQVAE, respectively. In SkeletonMVAE, we constrain the latent variables to represent features in the form of distributions. In SkeletonMVQVAE, we discretize the latent variables. These help the encoder learn deeper data structures and more discriminative and generalized feature representations. The experiment results on the NTU-60 and NTU-120 datasets demonstrate that our proposed method can effectively improve the classification accuracy of the encoder in classification tasks and its generalization ability in the case of few labeled data. SkeletonMVAE exhibits stronger classification ability, while SkeletonMVQVAE exhibits stronger generalization in situations with fewer labeled data.
format Article
id doaj-art-65f47d78ba614d8294971e0d8ef11b1f
institution Kabale University
issn 1662-5218
language English
publishDate 2025-02-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Neurorobotics
spelling doaj-art-65f47d78ba614d8294971e0d8ef11b1f2025-02-12T07:26:45ZengFrontiers Media S.A.Frontiers in Neurorobotics1662-52182025-02-011910.3389/fnbot.2024.14822811482281Latent space improved masked reconstruction model for human skeleton-based action recognitionEnqing Chen0Xueting Wang1Xin Guo2Ying Zhu3Dong Li4School of Electrical and Information Engineering, Zhengzhou University, Zhengzhou, ChinaSchool of Electrical and Information Engineering, Zhengzhou University, Zhengzhou, ChinaSchool of Electrical and Information Engineering, Zhengzhou University, Zhengzhou, ChinaState Grid Henan Electric Power Company Information and Communication Branch, Zhengzhou, ChinaState Grid Henan Electric Power Company Information and Communication Branch, Zhengzhou, ChinaHuman skeleton-based action recognition is an important task in the field of computer vision. In recent years, masked autoencoder (MAE) has been used in various fields due to its powerful self-supervised learning ability and has achieved good results in masked data reconstruction tasks. However, in visual classification tasks such as action recognition, the limited ability of the encoder to learn features in the autoencoder structure results in poor classification performance. We propose to enhance the encoder's feature extraction ability in classification tasks by leveraging the latent space of variational autoencoder (VAE) and further replace it with the latent space of vector quantized variational autoencoder (VQVAE). The constructed models are called SkeletonMVAE and SkeletonMVQVAE, respectively. In SkeletonMVAE, we constrain the latent variables to represent features in the form of distributions. In SkeletonMVQVAE, we discretize the latent variables. These help the encoder learn deeper data structures and more discriminative and generalized feature representations. The experiment results on the NTU-60 and NTU-120 datasets demonstrate that our proposed method can effectively improve the classification accuracy of the encoder in classification tasks and its generalization ability in the case of few labeled data. SkeletonMVAE exhibits stronger classification ability, while SkeletonMVQVAE exhibits stronger generalization in situations with fewer labeled data.https://www.frontiersin.org/articles/10.3389/fnbot.2025.1482281/fullhuman skeleton-based action recognitionvariational autoencodervector quantized variational autoencodermasked reconstruction modelself-supervised learning
spellingShingle Enqing Chen
Xueting Wang
Xin Guo
Ying Zhu
Dong Li
Latent space improved masked reconstruction model for human skeleton-based action recognition
Frontiers in Neurorobotics
human skeleton-based action recognition
variational autoencoder
vector quantized variational autoencoder
masked reconstruction model
self-supervised learning
title Latent space improved masked reconstruction model for human skeleton-based action recognition
title_full Latent space improved masked reconstruction model for human skeleton-based action recognition
title_fullStr Latent space improved masked reconstruction model for human skeleton-based action recognition
title_full_unstemmed Latent space improved masked reconstruction model for human skeleton-based action recognition
title_short Latent space improved masked reconstruction model for human skeleton-based action recognition
title_sort latent space improved masked reconstruction model for human skeleton based action recognition
topic human skeleton-based action recognition
variational autoencoder
vector quantized variational autoencoder
masked reconstruction model
self-supervised learning
url https://www.frontiersin.org/articles/10.3389/fnbot.2025.1482281/full
work_keys_str_mv AT enqingchen latentspaceimprovedmaskedreconstructionmodelforhumanskeletonbasedactionrecognition
AT xuetingwang latentspaceimprovedmaskedreconstructionmodelforhumanskeletonbasedactionrecognition
AT xinguo latentspaceimprovedmaskedreconstructionmodelforhumanskeletonbasedactionrecognition
AT yingzhu latentspaceimprovedmaskedreconstructionmodelforhumanskeletonbasedactionrecognition
AT dongli latentspaceimprovedmaskedreconstructionmodelforhumanskeletonbasedactionrecognition