End-to-End Recognition of Spontaneous Speech on the Hungarian BEA Database

Tímea Fekete; Péter Mihajlik

doi:10.15775/Besztud.2021.261-272

Fekete Tímea BME TÁVKÖZLÉSI ÉS MÉDIAINFORMATIKAI TANSZÉK
Mihajlik Péter BME TÁVKÖZLÉSI ÉS MÉDIAINFORMATIKAI TANSZÉK

DOI: https://doi.org/10.15775/Besztud.2021.261-272

Kulcsszavak: end-to-end speech recognition, deep neural networks, spontaneous speech, Hungarian

Absztrakt

The end-to-end deep neural network based speech recognition approach is increasingly popular due to its fully data driven nature - no language-specific knowledge is needed beyond the transcribed speech data. However, most of the end-to-end speech recognition experiments are performed on read (planned) speech and no Hungarian language results are available for the Speech Community. In this paper, we make the first attempt to train and evaluate a Hungarian speech recognition system based on the studio-quality Hungarian BEA (Spoken Language Speech Database) in an end-to-end neural manner. We present the challenge of recognising spontaneous speech: even without any significant background noise, the word error rate on spontaneous speech is an order of magnitude higher than in the case of planned speech - both recorded with the same speakers in the same environment. This emphasises the need for more thorough studies of spontaneous speech and possibly for more data.