Speech and Music Technology

Speech and Music Technology

Digital processing of speech is considered as one of the most important technological areas of research and development in the field of human language technologies and signal processing, and has been an active area of research for several decades. Some of the key domains in speech processing are those related to speech synthesis, speech recognition and speaker identification and verification. The central role of speech in human interaction and the increasing interest in voice communication has stimulated intensive research offering many new and significant applications. During the last years, ILSP actively contributes to developing methods, systems, resources and tools in the aforementioned areas. In this context, ILSP continuously plans and adapts its research and development activities so as to effectively respond to this rapid technological evolution. Further to the above, a long-term aim is to continue designing and developing core technologies and prototypes with the intent of exploiting them in innovative applications and services, as well as to develop resources and tools in these fields and to actively participate in National and European research and development projects.

The goal of text to speech synthesis (TtS) is to automatically convert text content into natural, human-like speech. Unit-selection concatenative synthesis [Hunt & Black, 1996] is currently the dominant speech synthesis approach as it still produces the highest synthetic speech quality in terms of naturalness and intelligibility. However, it does not particularly lend itself to expressive speech nor is speaker independent; only a "neutral" speaking style is possible or expressive speech of limited expressivity suitable for narrow application domains. Nevertheless, emerging approaches following a data-driven methodology offer new potentials to raising the level of achieved quality even higher, as this is pronounced and concluded in international scientific contests in the TTS field (i.e., The Blizzard Challenge, www.synsig.org).     

In addition, due to the above, there is a growing interest in shifting to new paradigms employing parametric or hybrid approaches in TtS. Such approaches include more recent statistical parametric approaches [Zen et al., 2009] (for example, based on HMMs [Yoshimura et al., 1999; Karabetsos et al., 2008]), but also a resurgence of older approaches in the light of recent developments such as improved vocoding techniques or other parametric (or hybrid) speech models [Carlson & Gustafson, 2009; Öhlin & Carlson, 2004; Raptis et al., 2001; Acero, 1999]. Among the advantages of these methods is that the speech representation scheme they adopt, allow for easier manipulation of speech without severe quality degradation. This way, they offer an efficient framework for voice conversion (e.g. [Tachibana et al., 2008]), prosodic modelling etc. Furthermore, the computational and storage requirements of the resulting TtS systems are significantly lower, thus making them more suitable for low-resourced environments such as cell phones and portable devices. Finally, parametric models can offer a bridge between speech synthesis and speech recognition (e.g. [Eichner et al., 2000]). However, the speech quality of such systems still remains, in most cases, lower that that of unit selection systems (e.g. [Karaiskos et al., 2008]). This is partly due to known issues in parametric TtS, which are the subject of research at international level. Such are the side-effects of vocoding, the adequacy of the acoustic modelling and the over-smoothing of parameter trajectories as a result of their statistical modelling ([Zen et al., 2009]). In addition, the parametric TtS framework is particularly suitable for modelling not only prosodic and acoustic speech parameters but also parameters that relate to facial expression and body movement. Thus, they can be naturally extended to the area of multimodal (or audiovisual) speech synthesis so as to drive not only acoustic models but also graphical models of synthesis [Carlson & Granström, 2005].

Further to the above, the next scientific milestone in TTS is the so called Expressive Speech Synthesis. Expressiveness, affect and emotion are expected to fuel tomorrow’s natural speech interfaces and dialog systems hence, the perception and generation of affective/emotional speech patterns by computers is becoming a central issue. Expressive speech synthesis is a multidisciplinary research area that addresses some of the most complex problems in speech and language processing [Campbel, 2006]. Expressive Speech Synthesis would not only contribute significantly towards the development of the next generation of speech synthesis systems and speech emotion recognition research, but it would also provide insight into what constitutes expressiveness in speech and what stylistic patterns it employs. Expressive speech synthesis is currently the focus of the international scientific community independently of the underlying technology (www.synsig.org).

The term automatic speech recognition (ASR) covers a broad spectrum of research topics and applications; from a typical large-vocabulary, speaker-independent dictation system to a more demanding automatic subtitling system. The variety of applications and their social impact has led one of the ILSP’s research groups to focus on the subject for the last ten to fifteen years. At present state-of-the-art ASR systems tend to fulfil three important characteristics which are (a) robustness, (b) portability and (c) adaptability. The performance of a robust system will not degrade catastrophically as functioning conditions change from those the system has been trained with. On the other hand portability of a system insures that its performance does not change as we move from one task to another. Finally adaptability guarantees the continuous adaptation of the system to changing conditions.

Currently cutting edge technology in speech recognition uses HMMs (Hidden Markov Models) combined with MFCCs (Mel-Frequency Cepstrum Coefficients) and the Viterbi algorithm to enable fast search among the numerous possibilities. The optimal handling of these subsystems in conjunction with the quality of the resources for the design of the lexicon and the acoustic and language models lead to some of the most efficient speech recognizers.

Speaker identification/verification research in recent years utilize a novel and speaker-oriented representation of speech that maps each speaker utterance onto a fixed length feature space, called the i-vector (Senoussaoui et al. 2010). This new feature space is based on the dimensionality reduction method of probabilistic Linear Discriminant Analysis (pLDA) that originates from the similar problem of face recognition in computer vision (Prince &. Elder, 2007). This pLDA basis (i-vectors) (Kenny, P. et al. 2013) offers an elegant distinction between speaker variability and channel variability and hence eliminates the need of channel-based likelihood score-normalization techniques. Furthermore, the state-of-the-art verification stage overcomes the baseline GMM-UBM adaptation and scoring paradigm (Stafylakis, et al. 2010). The next generation verification systems will be based on a fully-Bayesian statistical framework using the Variational Bayes approach to inference that allows a tuning-free approach to verification (P. Kenny, 2010). 

Ινστιτούτο Επεξεργασίας του Λόγου