Revolutionizing AI with Cross-Speaker Emotion Transfer

Language is a crucial aspect of human interaction, but the emotion behind the words is equally important. Expressing various emotions like happiness, sadness, anger, and frustration helps us convey messages and connect with others on a deeper level. While generative AI has made significant advancements in many fields, it has struggled to understand and incorporate these nuances of human emotion. This is where Typecast, a startup utilizing AI for synthetic voices and videos, comes in.

Typecast is making waves in the AI landscape with its groundbreaking Cross-Speaker Emotion Transfer technology. This revolutionary technology enables users to apply emotions from one person’s voice to their own voice, while still maintaining their unique style. This empowers content creators with a faster and more efficient way to produce engaging content. The Cross-Speaker Emotion Transfer is now available through Typecast’s My Voice Maker feature.

“AI actors have yet to fully capture the emotional range of humans, which is their biggest limiting factor.”

– Taesu Kim, CEO and cofounder of Neosapience and Typecast

According to Taesu Kim, the CEO and cofounder of Neosapience and Typecast, the emotional range of humans remains unconquered territory for AI actors. However, with the new Typecast Cross-Speaker Emotion Transfer, anyone can utilize AI actors that possess a real emotional depth based on just a small sample of their voice.

Although emotions are often categorized into seven universal facial movements, such as happiness, sadness, anger, fear, surprise, and disgust, these alone do not capture the full spectrum of emotions in speech generation. Kim highlights that speaking is not a simple one-to-one mapping between text and spoken output. Humans have the capacity to express the same sentence in countless different ways. Moreover, distinct emotions can be conveyed within the same sentence or even the same word.

For example, recording the sentence “How can you do this to me?” with the emotional prompt “In a sad voice, as if disappointed” will yield an entirely different result than the prompt “Angry, like scolding.” The complexity of emotions goes beyond predefined categories. As Kim and other researchers emphasize, “Humans can speak with different emotions, and this leads to rich and diverse conversations.”

Progress in Emotional Text-to-Speech

Text-to-speech technology has made significant advancements recently, led by notable models including ChatGPT, LaMDA, LLama, Bard, Claude, and other pioneering systems. Emotional text-to-speech has also made considerable progress, but it requires a substantial amount of labeled data that is not readily available. Kim explains that capturing the subtleties of various emotions through voice recordings has been time-consuming and challenging.

Kim and his colleagues further point out the difficulty of consistently maintaining emotion while recording multiple sentences over an extended period. Traditional emotional speech synthesis relies on training data that is labeled with emotions. These methods often involve additional emotion encoding or reference audio. However, a fundamental challenge arises when there is a requirement for data covering every emotion and every speaker. Moreover, the existing approaches are prone to mislabeling problems and have difficulty extracting the intensity of emotions.

Breaking New Ground with Cross-Speaker Emotion Transfer

Cross-speaker emotion transfer becomes especially challenging when an unseen emotion is assigned to a speaker. The current technology has fallen short in this regard since it is unnatural for emotional speech to be produced by a neutral speaker instead of the original speaker. Additionally, controlling emotion intensity is often not feasible.

To address these challenges, the researchers at Typecast took a groundbreaking approach. They first input emotion labels into a generative deep neural network, marking a world first. While this method achieved success, it was not sufficient to express sophisticated emotions and speaking styles. Therefore, the researchers developed an unsupervised learning algorithm to discern speaking styles and emotions from a vast database.

During training, the entire model was taught without any emotion labels. This allowed the researchers to obtain representative numbers from various speech samples. Although these representations are not easily interpretable by humans, they can be used in text-to-speech algorithms to express emotions stored in the database. Additionally, the researchers created a perception neural network to translate natural language emotion descriptions into these representations.

“With this technology, the user doesn’t need to record hundreds or thousands of different speaking styles/emotions because it learns from a large database of various emotional voices.”

– Taesu Kim, CEO and cofounder of Neosapience and Typecast

In conclusion, the researchers successfully achieved “transferable and controllable emotion speech synthesis” by leveraging latent representations. They employed domain adversarial training and cycle-consistency loss to disentangle the speaker from the style. Furthermore, the technology learned from an extensive collection of recorded human voices, including audiobooks, videos, and other mediums, to analyze and comprehend emotional patterns, tones, and inflections.

The result is a method that can transfer emotions to a neutral reading-style speaker using just a few labeled samples. Emotion intensity can be easily controlled through a scalar value, ensuring natural and seamless emotion transfer without changing one’s unique voice identity. Users can record a basic snippet of their voice and apply a range of emotions and intensity, with the AI adapting to their specific voice characteristics.

Typecast’s technology has already been adopted by renowned companies such as Samsung Securities and LG Electronics in South Korea. With $26.8 billion in funding since its establishment in 2017, the startup is also exploring applications for its core speech synthesis technologies in facial expressions. As Kim highlights, the media landscape is rapidly evolving, and high-quality expressive voice is indispensable for delivering corporate messages. Typecast’s advancements enable ordinary individuals and companies to unleash their creative potential and improve productivity.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts