A speech recognition method with enhanced transformer decoder

Abstract Addressing the issue that the Transformer decoder struggles to capture local features for monotonic alignment in speech recognition, and simultaneously incorporating language model cold fusion training into the decoder, an enhanced decoder-based speech recognition model is investigated. The...

Full description

Saved in:

Bibliographic Details
Main Authors:	Hengbo Hu, Tong Niu, Zhenhua He
Format:	Article
Language:	English
Published:	SpringerOpen 2025-02-01
Series:	EURASIP Journal on Audio, Speech, and Music Processing
Subjects:	Cross-attention Transformer decoder Language model cold fusion
Online Access:	https://doi.org/10.1186/s13636-025-00394-6
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1823861721475842048
author	Hengbo Hu Tong Niu Zhenhua He
author_facet	Hengbo Hu Tong Niu Zhenhua He
author_sort	Hengbo Hu
collection	DOAJ
description	Abstract Addressing the issue that the Transformer decoder struggles to capture local features for monotonic alignment in speech recognition, and simultaneously incorporating language model cold fusion training into the decoder, an enhanced decoder-based speech recognition model is investigated. The enhanced decoder separates and combines the two attention mechanisms in the Transformer decoder into cross-attention layers and a self-attention language model module. The cross-attention layers are utilized to capture local features more efficiently from the encoder output, and the self-attention language model module is used to pre-train with additional domain-related text, followed by cold fusion training. Experimental results on the Mandarin Aishell-1 dataset demonstrate that when the encoder is a Conformer, the enhanced decoder achieves a 16.1% reduction in character error rate compared to the Transformer decoder. Furthermore, when the language model is pre-trained with suitable text data, the performance of the cold fusion-trained model is further enhanced.
format	Article
id	doaj-art-5b6fd2d632584e43854bd517d0fc49eb
institution	Kabale University
issn	1687-4722
language	English
publishDate	2025-02-01
publisher	SpringerOpen
record_format	Article
series	EURASIP Journal on Audio, Speech, and Music Processing
spelling	doaj-art-5b6fd2d632584e43854bd517d0fc49eb2025-02-09T12:48:48ZengSpringerOpenEURASIP Journal on Audio, Speech, and Music Processing1687-47222025-02-012025111210.1186/s13636-025-00394-6A speech recognition method with enhanced transformer decoderHengbo Hu0Tong Niu1Zhenhua He2Research and Development Department 1 - Intelligent Speech Technology Team, Zhengzhou Xinda Institute of Advanced TechnologySchool of Information Systems Engineering, University of Information EngineeringResearch and Development Department 1 - Intelligent Speech Technology Team, Zhengzhou Xinda Institute of Advanced TechnologyAbstract Addressing the issue that the Transformer decoder struggles to capture local features for monotonic alignment in speech recognition, and simultaneously incorporating language model cold fusion training into the decoder, an enhanced decoder-based speech recognition model is investigated. The enhanced decoder separates and combines the two attention mechanisms in the Transformer decoder into cross-attention layers and a self-attention language model module. The cross-attention layers are utilized to capture local features more efficiently from the encoder output, and the self-attention language model module is used to pre-train with additional domain-related text, followed by cold fusion training. Experimental results on the Mandarin Aishell-1 dataset demonstrate that when the encoder is a Conformer, the enhanced decoder achieves a 16.1% reduction in character error rate compared to the Transformer decoder. Furthermore, when the language model is pre-trained with suitable text data, the performance of the cold fusion-trained model is further enhanced.https://doi.org/10.1186/s13636-025-00394-6Cross-attentionTransformer decoderLanguage model cold fusion
spellingShingle	Hengbo Hu Tong Niu Zhenhua He A speech recognition method with enhanced transformer decoder EURASIP Journal on Audio, Speech, and Music Processing Cross-attention Transformer decoder Language model cold fusion
title	A speech recognition method with enhanced transformer decoder
title_full	A speech recognition method with enhanced transformer decoder
title_fullStr	A speech recognition method with enhanced transformer decoder
title_full_unstemmed	A speech recognition method with enhanced transformer decoder
title_short	A speech recognition method with enhanced transformer decoder
title_sort	speech recognition method with enhanced transformer decoder
topic	Cross-attention Transformer decoder Language model cold fusion
url	https://doi.org/10.1186/s13636-025-00394-6
work_keys_str_mv	AT hengbohu aspeechrecognitionmethodwithenhancedtransformerdecoder AT tongniu aspeechrecognitionmethodwithenhancedtransformerdecoder AT zhenhuahe aspeechrecognitionmethodwithenhancedtransformerdecoder AT hengbohu speechrecognitionmethodwithenhancedtransformerdecoder AT tongniu speechrecognitionmethodwithenhancedtransformerdecoder AT zhenhuahe speechrecognitionmethodwithenhancedtransformerdecoder

A speech recognition method with enhanced transformer decoder

Similar Items