by Robert Vích
A Text-to-Speech (TTS) system has been proposed in collaboration of the Speech Processing Laboratory of the Institute of Radio Engineering and Electronic, Academy of Sciences of the Czech Republic and the Institute of Phonetics of the Faculty of Arts, Charles University.
This system has been in production since 1993 by the Cooperative of the Blind (SPEKTRA) in Prague, mainly as a reading machine for the blind. The TTS-system is based on the concatenation of elementary speech units representing diphones and subphonemic elements of natural male/female speech. The inventory consists of 441 speech units by which all sounds in Czech and Slovak can be synthesized. For the description of the speech units linear predictive coding (LPC) combined with vector quantization was used. The synthesis is carried out by a modified vocal tract model. For voiced excitation a periodical multipulse unit-sample response of a broad-band Hilbert transformer is used, unvoiced segments are excited by coloured noise.
The TTS-system involves:
The PC based TTS-system runs in real time with the sampling frequency 8 kHz. The synthesis part of the system which models the human vocal tract is implemented on a signal processor board with TMS 320 C25. The system is completed with a screen reader and can be used also together with a scanner for reading printed texts. Recently also a purely software implementation of the system has been developed. In this implementation all calculations run in a PC equipped with a Sound Blaster card. The naturalness of the synthetic speech is to a great deal dependent on the implementation of the prosodic features of the text. For this reason, the TTS-system controls the word and sentence prosody not only by changes of the fundamental frequency in voiced units, but also by changes of the duration of speech units and by intensity variation. The TTS-system enables the choice of the speech rate and also the fundamental frequency can be chosen for the male/female voice in a reasonable interval. In the application of the TTS-system for the blind, an increase of the speech rate up to 300% of the basic rate (given by the speech rate in diphone labelling of natural speech, which corresponds approx. to 100 words/minute) was required. The changes of the fundamental frequency and of the duration of the speech units caused by the prosodic rules, together with the possible choice of the fundamental frequency level and of the speech rate, led to synchronization of the speech synthesis with the actual fundamental frequency, in order to maintain the quality of the synthetic speech for different combination of these parameters. This procedure has been called pitch synchronous linear predictive synthesis. The intelligibility and naturalness produced by the TTS-system is good.
Further improvement of the quality could be reached by extending the speech unit inventory, by using pole/zero modelling of the speech production and by defining better prosodic rules. The project on TTS synthesis was a part of the already completed action of the Commission of European Communities COST 233 Prosodics of Synthetic Speech.
Please contact:
Robert Vích - Institute of Radio Engineering and Electronic, Czech
Academy of Sciences
Tel: +42 2 6881804
E-mail: vich@ure.cas.cz