A speech recognition method with enhanced transformer decoder
Abstract Addressing the issue that the Transformer decoder struggles to capture local features for monotonic alignment in speech recognition, and simultaneously incorporating language model cold fusion training into the decoder, an enhanced decoder-based speech recognition model is investigated. The...
Saved in:
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
SpringerOpen
2025-02-01
|
Series: | EURASIP Journal on Audio, Speech, and Music Processing |
Subjects: | |
Online Access: | https://doi.org/10.1186/s13636-025-00394-6 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Abstract Addressing the issue that the Transformer decoder struggles to capture local features for monotonic alignment in speech recognition, and simultaneously incorporating language model cold fusion training into the decoder, an enhanced decoder-based speech recognition model is investigated. The enhanced decoder separates and combines the two attention mechanisms in the Transformer decoder into cross-attention layers and a self-attention language model module. The cross-attention layers are utilized to capture local features more efficiently from the encoder output, and the self-attention language model module is used to pre-train with additional domain-related text, followed by cold fusion training. Experimental results on the Mandarin Aishell-1 dataset demonstrate that when the encoder is a Conformer, the enhanced decoder achieves a 16.1% reduction in character error rate compared to the Transformer decoder. Furthermore, when the language model is pre-trained with suitable text data, the performance of the cold fusion-trained model is further enhanced. |
---|---|
ISSN: | 1687-4722 |