Lip Sync Intermediate

Lip synchronization brings avatars to life by matching mouth movements to speech. In this lesson, you will implement audio-driven lip sync using visemes and blend shapes, creating realistic facial animation that works with both pre-recorded audio and real-time speech.

Understanding Visemes

Visemes are the visual equivalent of phonemes — they represent the mouth shapes needed to produce specific speech sounds. The Oculus viseme set, which Ready Player Me supports, defines 15 standard mouth positions that cover all English speech sounds.

VisemePhonemeExample
silSilenceMouth closed
aa/a/"father"
ih/i/"bit"
oh/o/"go"
ou/u/"boot"

Blend Shape Animation

Ready Player Me avatars include blend shapes (morph targets) on the face mesh. Each viseme maps to a blend shape index on the SkinnedMeshRenderer. To animate lip sync, you set blend shape weights based on audio analysis.

C#
public class LipSyncController : MonoBehaviour
{
    private SkinnedMeshRenderer faceMesh;
    private AudioSource audioSource;
    private float[] samples = new float[256];

    private void Update()
    {
        audioSource.GetSpectrumData(samples, 0, FFTWindow.Rectangular);
        float volume = GetRMSVolume();

        // Map volume to jaw open blend shape
        faceMesh.SetBlendShapeWeight(jawOpenIndex,
            Mathf.Lerp(faceMesh.GetBlendShapeWeight(jawOpenIndex),
            volume * 100f, Time.deltaTime * 12f));
    }
}

Real-Time vs. Pre-Baked

Choose Your Approach: For pre-recorded dialogue, use tools like Oculus Audio SDK or Rhubarb Lip Sync to pre-bake viseme timing data. For real-time voice chat or TTS, use audio spectrum analysis to drive blend shapes on the fly.

Facial Expressions

Beyond lip sync, combine blend shapes for eyebrow raises, eye squints, and smile/frown to convey emotion. Layer these on top of lip sync for truly expressive avatars. Use Animator parameters or a custom expression system to blend between emotional states.