Summary

The evoc-learn project uses computational modelling to address the mystery of how children learn to speak. We will test the hypothesis that children learn by using their own articulators as a learning device to experiment with different vocal manoeuvres until they sound sufficiently like adults. We will use a state-of-the-art articulatory synthesizer to emulate all the critical elements of the learning process, with the goal to generate highly accurate and natural sounding speech with the learned articulatory parameters. This will be done in five subprojects to assess the role and effectiveness of each hypothetical mechanism in the learning process.

Context

One of the greatest mysteries of human language is how infants learn to speak without being explicitly taught. They are unlikely coached by caregivers or older siblings, as the instructions would be undetailed, infrequent and, critically, incomprehensible. Neither can they learn by watching other people talk, as most of the speech articulators are hidden. The only guaranteed input for the infants is the sounds of speech. It has therefore been hypothesized that children learn to speak by mimicking the speech sounds they hear. But this vocal mimicry hypothesis has been met with scepticism. A major objection is that the large differences between children’s articulator dimensions and those of the adults would make it impossible for children to compare their mimicry with the target utterance.

This speaker normalization problem has not been solved by behavioural or neural Page 10 of 20 studies alone, as it is hard to identify from direct observations the exact underlying mechanisms of vocal learning. What is needed are computational models that can simulate the step-by-step progression of vocal learning. Research in this direction by other groups, however, has only been able to generate synthetic utterances that sound unnatural and with low intelligibility. Our own preliminary research has discovered a method, called auditory-guided articulatory-based vocal learning (AAVL), that can solve both the speaker normalization and naturalness problems. The central idea is to not attempt any explicitly speaker normalization, but focus on processing speech signals in as much detail as possible, and on simulating articulatory mechanisms as authentically as possible.

The improvement on acoustic processing is through the use of a parameter (MFCC — Mel frequency cepstral coefficient) that has been proven highly successful in speech technology. And the improvement on simulating articulatory mechanisms is done by using a high-quality articulatory synthesizer as a speech generator, and controlling the synthesizer with the target approximation model—a model of articulatory dynamics developed in our previous research. The vowels synthesized with the learned articulatory parameters reached the same level of perceptual accuracy and naturalness as human- produced vowels, and that a child vocal tract could be trained by adult-produced syllables just as easily although no speaker normalization was performed.

Objectives

The ultimate goal of our research is to use computational modelling to reveal how children learn to speak without being explicitly taught. We will test the possibility that the learning is achieved through repeated mimicry without explicit speaker normalization. The main hypothesis of this project is that both the effectiveness and explanatory power of modelling simulation are contingent on whether all the critical aspects of vocal learning are authentically emulated. Specifically, we will test if the following mechanisms are critical for vocal learning.

  1. Synchronized onset of consonants and vowels at the beginning of the syllable is the core mechanism of the syllable, which is vital for vocal learning
  2. Visual input is a secondary source of training signal critical for sounds that involve visible articulators like the lips
  3. Corrective auditory feedback is a late-developing learning mechanism that helps to accelerate vocal learning as the child grows older
  4. Perceptually learned auditory templates can be used as training signals, instead of live speech, for vocal learning
  5. Babbling is a mechanism for:
    • a) discovering the syllable frame as a means of synchronizing multiple articulators, and
    • b) proactively exploring the mapping between vocal tract configurations and acoustic patterns

Significance and originality

Developing a coherent understanding of the basic mechanisms of human vocal learning is long overdue. Speech is acquired during childhood, and this is what allows the unique human ability to communicate complex ideas to be passed on across generations. It is still unclear how exactly this acquisition is accomplished. The earliest stage of the acquisition is the most baffling, as at that time infants can neither understand instructions nor ask questions.

Computational simulation offers a means to identify the specific steps and conditions needed for the success of vocal learning. Our modelling aims at simulating vocal learning to the extent that a trained articulatory synthesizer can generate syllables that are both intelligible and natural sounding. This has never been achieved by other research groups, but has been partially demonstrated by our preliminary results. The success of this project will therefore critically enhance our knowledge of vocal learning.

Also, by removing or weakening various aspects of a successful simulated learning process, we can identify likely sources of specific deficits in various speech and developmental disorders. Moreover, once natural sounding speech can be generated in the simulation, significant insights would be gained about long-standing theoretical issues like coarticulation, syllable formation and motor equivalence. An effective simulation of speech as a skilled motor movement may also have implications for motor control and motor learning in general. Finally, a full simulation that can generate natural sounding speech may have implications for speech technology, robotics and artificial intelligence.

This project is highly original in that it is the first to try to emulate all the critical aspects of the vocal process as true to life as possible. Also, as a major departure from the common practice of demonstrating only incremental improvements over a low performance baseline, for all simulations, a human level of performance in terms of both intelligibility and naturalness will be targeted.

Methodology

The basic methodology is to comprehensively test auditory-guided articulatory-based vocal learning (AAVL)— the method found to be effective in our preliminary research, and to extend it in a number of new directions. The testing will be done in five subprojects, each for evaluating one or more hypothetical articulatory or learning mechanisms.