Artikulációs beszédszintézis megvalósítása dinamikus ultrahangfelvételek alapján
Absztrakt
Starting from 2D dynamic ultrasound sources recording the movement of the vocal organs and the speech signal of the speaker in a simultaneous and synchronised manner, we produce machine speech by means of artificial intelligence. As visual objects, we use tongue and palate contours fitted automatically to the anatomic boundaries of the ultrasound images, and for training, we extract geometric information from these contours, as the change of their shape fundamentally describes the movement of the vocal organs during articulation. The geometric data consist of radial distances between the tongue and palate contours and coefficients of the discrete cosine transform of the curves, respectively. Relying on this dataset, parameters connected to the acoustic content of the speech signal are trained by the network. These parameters can be interpreted in the framework of the acoustic tube model of the vocal tract, and according to this, reflection coefficients and areas of the articulation channel are to be trained. In this study, sentences are synthesised using linear predictive coding and the acoustic tube model.