Skip to content
  • There are no suggestions because the search field is empty.

Advanced Lip Sync Theory

Learn how to further assess sync and natural mouth movements.

Visemes & Phonemes Overview

  • Phoneme:

    • The smallest unit of sound in a given language that affects meaning within a word. A unit of sound in a language that cannot be analyzed into smaller linear units and that can distinguish one word from another.

However, a more practical description can be found in the excerpt below from the from Human Motor Control, 2nd Edition:

When we speak, we produce words. These consist of phonemes (roughly, vowels and con- sonants), which allow for meaning distinctions in a language. /l/ and /r/ are phonemes in English, as shown by the fact that “lip” and “rip” have different meanings. In Mandarin Chinese, however, word meanings are never distinguished by /l/ or /r/. Thus, /l/ and /r/ are not phonemes in Mandarin Chinese. On the other hand, speakers of Mandarin Chinese distinguish between [l] and [r]. They say [l] only at word beginnings and [r] only at word endings. [l] and [r] are therefore said to be two phones of Mandarin Chinese. Phones are sound categories that can but do not necessarily convey meaning. In English, [l] and [r] are distinct phones that also happen to be phonemes. In Mandarin Chinese, [l] and [r] are phones that happen not to be phonemes. Notice that phonemes are denoted here with slashes (/ /) while phones are denoted with brackets ([ ]).
  • Viseme:

    • A group of phonemes that are visually indistinct from one another. a “visual phoneme,” or, a group of phonemes that look the same (e.g. same lip position, jaw position, etc.) when produced. In the Consonants and Vowels section, you will find various phonemes listed together as a single viseme group. These groupings are made based on these phonemes’ visual similarities.

Consonants Overview

  • consonant: a speech sound produced by some form of air restriction

    • A consonant is a speech sound produced by some form of airflow restriction. This restriction can be created by the lips, teeth, and/or tongue. M, b, and p, for example, require the lips to meet (a form of airflow restriction) in order to be produced. These sounds are referred to as bilabials, more specifically, full closure bilabials. F and v, on the other hand, require lip-to-tooth contact (another form of airflow restriction) in order to produce enough friction to create the target sound. We will cover the remaining forms of restriction as we go through each group.

Consonant viseme groups we will cover include: 

m, b, & p

w & r

f & v

ch, sh, dge, & ʒ 

s & z

th

n & l

t & d

Below is a diagram to illustrate which articulators (vocal organs like the teeth, tongue, lips, throat, etc.) each consonant viseme is primarily driven by. 

lips teeth toungue

 

NOTE 1: We are restricting our analysis to the articulators of the teeth, tongue, and lips because we are focusing on visibility. Due to visibility restrictions (i.e. because we cannot see them), other articulators such as the larynx or inner mouth areas like the hard or soft palates are not a priority concern at this time. Because of their lack of visibility, h and k/g are missing from the chart. H’s airflow restriction occurs in the throat, and k/g’s restriction occurs when the back of the tongue blocks the throat. 

NOTE 2: Due to their indistinguishable outside appearance, w/r and oo are grouped as one viseme despite oo’s vowel status.

What consonants should we prioritize - and why? 

  • priority visemes:  viseme groups made of phonemes that have reliable properties to assess.

  • stop: a speech sound created by completely blocking the flow of air and then releasing it.

  • fricative:  refers to speech sounds created by air escaping from the mouth through a narrow passageway.

  • affricate:  a sound that starts as a stop and then releases as a fricative.

  • bilabial: a speech sound that requires both lips; in linguistics, this would refer to a w/r/oo sound or m/b/p; however, at Flawless, when we refer to bilabials we are exclusively referring to m/b/p’s. 

When evaluating speech, there are many sounds and moving parts to observe. To further complicate things, each sound is subject to variation caused by factors such as (but not limited to):

  • preceding and subsequent sounds (coarticulation) 

  • individual speaking style

  • state of alertness

  • volume level

  • emotional context

  • speed of speech

  • general randomness (e.g. the same person speaking in the same style, emotional context, volume level, state of alertness is unlikely to produce an [i] in the exact same manner each time)

Because going through every single fractional second of speech among varying contexts would be both overwhelming and inefficient, for our lipsync evaluation process we will focus on assessing priority visemes only. 

Priority visemes refer to viseme groups made of phonemes that have more robust properties. These viseme groups are less subject to context-based variation, making them more reliable to assess.

More Distinguished

Distinguished-ish

Less Distinguished

m/b/p

s/z

NOTE: s/z is not a distinguished lip shape, but it has robust restrictions regarding how close the upper and lower teeth need to be in order to produce it legibly.

t/d

 

w/r & oo
NOTE: The distinguished r’s tend to occur at the beginning of a word/syllable; less distinguished r’s tend to occur at the end of a word or syllable - e.g. rather vs. father. For rounded, distinguished r’s, we want to focus on the r’s at the beginning of a word/syllable.

ch/sh/dge/zh
(tʃ⁠/ʃ⁠⁠/⁠⁠/ʒ)

k/g

f/v

th
NOTE: th is not a distinguished lip shape - but has a prominent tongue presence.

h

 

n/l**
NOTE 1: n/l is mostly defined by tongue position, but the tongue is not always visible.

NOTE 2: There are two different types of l - light l and dark l. Light l’s occur BEFORE a vowel sound in a syllable; they have more tongue presence, and the tongue may stick out past the teeth (which can resemble a “th”). Dark l’s occur after vowels and do not pass the teeth.

 

Priority Viseme Groups

Priority Group: Level 1

m/b/p

  • This viseme group consists of phonemes that require the upper and lower lips to be fully closed.

  • We typically refer to these phonemes as bilabials. Bilabials are types of sounds that require both lips. The lips must interact together to create a full closure or near closure. m/b/p requires a full closure, whereas w is an example of a bilabial that requires partial closure.

w/r/oo

  • This viseme group consists of phonemes that require the upper and lower lips to be rounded and nearly closed.

  • Though here at Flawless, when we refer to a “bilabial” we are almost exclusively referring to an m/b/p, w’s are technically bilabials as well; they are simply bilabials with partial closure rather than full closure.

  • Though /r/ and oo are included in this group as well neither /r/ nor oo (IPA symbol /uː/) is considered a bilabial despite having lip configurations indistinguishable to /w/. Because oo is a vowel, it is instead considered a tense, rounded vowel.

  • It is important to note that in this viseme group, we are only interested in the /r/ sound when it occurs at the beginning of a word or syllable. When /r/ occurs at the beginning of a word or syllable, it takes on a strong rounded shape. For example, in the word “red,” the /r/ position is strongly rounded; however, when /r/ occurs at the end of a word or syllable, as it does in the word “father,” it occurs in a much looser form and is significantly more subject to context-based variation. 

  • As noted, this group contains a vowel, oo (formal IPA symbol /uː/). I have included the oo with the w’s and r’s due to its indistinguishable outside appearance. 

f/v

  • This viseme group consists of phonemes that require the lower lip to interact with the teeth to create restricted airflow. 

  • /f/ and /v/ are both labiodental, fricative sounds. Labiodental refers to sounds created via lip-to-tooth interaction, and fricative refers to speech sounds created by air escaping from the mouth through a narrow passageway.

  • Though it may be possible to produce a similar sound in the reverse, i.e. with the upper lip and lower teeth, you can expect to never see such a variation.

NOTE 1: The “fully closed” nature of m/b/p is not as absolute as it may seem. There are many cases in natural speech when we can produce an intelligible m/b/p without fully closing the lips. These almost-closed configurations can occur in a variety of circumstances but are most notable when:

  • someone is smiling while speaking (because the lips are more separated, making it take more energy to fully close the lips)

  • a /p/ is followed immediately by an /f/ - e.g. in words like “helpful,” “hopeful,” “stepfather,” etc. (because the consonants blend together to create a hybrid between /p/ and /f/ to maximize efficiency)

  • someone is slurring their speech or not enunciating (Happens more than you might think!)

Priority Group: Level 2

s/z

  • Like f/v, the s/z viseme group is also made up of fricatives. s/z and a subset of fricatives known as sibilants. Sibilants are created by forcing air through a narrow channel while also curling the tongue to direct air over the edge of the teeth.

  • Though s/z does not have a distinct lip shape or position (besides - lips must not be closed), due to its fricative nature, s/z is semi-distinguished in that it requires the upper and lower teeth to remain in close proximity. 

ch/sh/dge/zh, or designated in IPA as: tʃ⁠/ʃ⁠⁠/dʒ⁠⁠/ʒ

  • This viseme group consists of a mixture of phonemes that are considered affricates and fricatives. sh & zh are considered fricatives; whereas ch & dge are considered affricates. An affricate is a sound that starts as a stop and then releases as a fricative; this is why ch and dge’s IPA designation has two symbols: tʃ⁠⁠ for ch and dʒ for dge. In the former, the t represents the t-like stop. In the latter, the d represents the d-like stop. For example, when we say the word “reach,” during the “ch” part, we make a t-sound and then transition into the sh-style fricative. In the word “judge,” during the “dge” part, we start with a d-like stop and end in a zh-style fricative.

  • fricatives:

    • sh / ʃ⁠⁠

    • zh / ʒ

  • affricates: 

    • ch / tʃ⁠⁠

    • dge / dʒ

th

  • This viseme group consists of two different th sounds: voiced th (ð) and unvoiced th (θ). Both th’s are referred to as interdental fricatives; interdental fricatives are consonants that are created by placing the tongue between the teeth.

  • Due to the nature of fricatives requiring narrow air passageways, in order to produce the th group, the teeth must remain relatively close together and to the tongue. The more space there is between the teeth, the more the tongue needs to compensate for the increased openings. 

n/l

  • This viseme group consists of sounds that require the tongue to be positioned upward, pressing against the area behind the front teeth. The tongue positions for n vs. l are similar but not visually distinguishable due to occlusion from the teeth. Their main differences in sound come from their direction of airflow; for n, the tongue blocks air so it exits through the nose, and for l, the air goes around the side of the tongue.

  • This viseme group consists of two stop consonants that share the same lip, tongue, and jaw positions but differ in that /t/ is voiceless and /d/ is voiced. 

  • A stop consonant is a speech sound created by completely blocking the flow of air and then releasing it.

t/d

  • Compared to visemes in Priority Groups 1 and 2, t/d does not possess any highly distinguishing properties. This group does not offer visual contrast from other speech sounds and only requires the lips and jaw to be slightly or greater than slightly parted.

  • In most cases, t/d is undetectable (not able to be discerned from visual inspection). Only in cases when the lips and jaw are open enough to clearly see the tongue can t and d be distinguished from other phonemes. When these conditions are present, the tongue must lift up to tap the area behind the top teeth.

k/g

  • Even more undetectable than t/d, this viseme group consists of phonemes that experienced lip readers call invisible. Though k/g and t/d both only require the lips and jaw to be slightly or more open, k/g does not have a visible tongue position. Instead, its defining feature is a stop in the back of the throat, which is not observable under general conditions.

h

  • Almost as imperceptible as k/g is the h viseme. h does not require any particular lip position, but it does require moderate jaw opening. To produce an h sound, air must be passed in a constricted manner between the tongue and the roof of the mouth - OR the back in the throat. h is unvoiced.