Startups, like the prominent ElevenLabs, have invested millions to develop their own voice cloning algorithms and AI software. These programs replicate the voices of users. However, a new solution called OpenVoice has emerged from collaborative efforts between researchers at the Massachusetts Institute of Technology (MIT), Tsinghua University in Beijing, China, and Canadian AI startup MyShell.
OpenVoice sets itself apart from other voice cloning platforms by offering open-source voice cloning that is both almost instantaneous and provides detailed controls. With OpenVoice, users can clone voices with unparalleled precision, customizing every aspect from emotion to accent, rhythm, pauses, and intonation, using just a small audio clip.
“Clone voices with unparalleled precision, with granular control of tone, from emotion to accent, rhythm, pauses, and intonation, using just a small audio clip.” – MyShell
OpenVoice has gained attention due to its unique features and usability. MyShell has even provided a pre-reviewed research paper for a comprehensive understanding of how OpenVoice was developed. The company has also made OpenVoice accessible through various platforms including the MyShell web app interface, which requires a user account, and HuggingFace, which is publicly accessible without an account.
“MyShell wants to benefit the whole research community. OpenVoice is just a start. In the future, we will even provide grants & dataset & computing power to support the open-source research community. The core echo of MyShell is ‘AI for All.’” – Zengyi Qin, MIT and MyShell Researcher
The Advancements in Voice Cloning
In their scientific paper, the creators of OpenVoice, Qin, Zhao, Yu, and Sun, describe how they developed the voice cloning AI. OpenVoice consists of two AI models, the text-to-speech (TTS) model and the tone converter.
The TTS model controls the style parameters and languages, and it was trained on 30,000 sentences of audio samples from English and Chinese speakers. The emotion expressed in each sample was labeled, and the model learned intonation, rhythm, and pauses from these clips. On the other hand, the tone converter model was trained on over 300,000 audio samples from thousands of different speakers.
“Flexibility here means flexible control over styles/emotions/accent etc, and can adapt to any language. Nobody could do this before, because it is too difficult. I lead a group of experienced AI scientists and spent several months to figure out the solution. We found that there is a very elegant way to decouple the difficult task into some doable subtasks to achieve what seems to be too difficult as a whole. The decoupled pipeline turns out to be very effective but also very simple.” – Zengyi Qin
The combination of these two models allows OpenVoice to reproduce a user’s voice and modify the tone color, or emotional expression, of the spoken text. This approach not only provides impressive results but also requires fewer compute resources compared to other voice cloning methods.
MyShell: The Future of AI-Native Apps
Founded in 2023, MyShell is a Canadian startup based in Calgary, Alberta. With a $5.6 million seed round led by INCE Capital, along with investments from Folius Ventures, Hashkey Capital, SevenX Ventures, TSVC, and OP Crypto, the company has proven its potential in the AI market. MyShell already boasts over 400,000 users and continues to grow.
Besides OpenVoice, MyShell offers a decentralized and comprehensive platform for discovering, creating, and staking AI-native apps. Their web app features various text-based AI characters and bots with distinct personalities, including some NSFW options. Additionally, the platform includes an animated GIF maker and user-generated text-based RPGs, some of which feature popular franchises like Harry Potter and Marvel.
To sustain its operations, MyShell monetizes its services through a monthly subscription for web app users and third-party bot creators. The company also generates revenue by selling AI training data.