OpenAI's Voice Engine is cloning your voice by hearing only a 15 second sample

OpenAI's Voice Engine is cloning your voice by hearing only a 15 second sample


01 April 2024

OpenAI is facing a major quandary. The corporation is concerned about the possible exploitation of their sophisticated AI model for voice cloning due to its high performance. That's why OpenAI is hesitant to share the model with the public. The business merely showed a preview of the Voice Engine model, demonstrating its capabilities. And it is really stunning.

The fundamentals of AI-based voice cloning technology are rather basic. The model requires only two inputs: an audio sample of the original voice and the text that the synthetic voice is meant to read. Feed the tool with enough samples, and the outcome should sound realistic enough. This is when things become fascinating, and a little dangerous. Unlike other models that are already publically available, Voice Engine requires only 15 seconds of audio from the original speaker. Despite the restricted input, the voice expressions are extremely lifelike.

That is exactly why OpenAI is taking its time deciding what to do next, citing its commitment to create safe and broadly beneficial AI. Malicious actors may use this great instrument to disseminate misinformation.

Voice Engine was initially developed in late 2022. Since then, it has powered the text-to-speech API's preset voices, as well as ChatGPT Voice and Read Aloud. Late last year, OpenAI began quietly testing its voice-cloning capabilities with a select set of trusted partners. The business claims it is impressed with the apps developed by this group.

One purpose of these testing is to determine how people and various sectors may profit from it. The other cause is to detect the possibility for its misuse and decide what steps to take.

At the same time, we are taking a cautious and informed approach to a broader release due to the potential for synthetic voice misuse. We hope to start a dialogue on the responsible deployment of synthetic voices, and how society can adapt to these new capabilities. Based on these conversations and the results of these small scale tests, we will make a more informed decision about whether and how to deploy this technology at scale.

OpenAI advocates for policies and counter-measures to prevent misuse of the technology as it becomes more widely available. For example, original speakers should willingly provide their voices to the service, and the service should be able to verify this. Additionally, the services should include a "no-go list" of celebrities, politicians, and other significant persons whose voice re-creation is disallowed.

The Voice Engine presentation should inspire public discussion. The company suggests the following procedures to mitigate any problems:

  • Phasing out voice-based authentication as a security measure for accessing bank accounts and other sensitive information
  • Exploring policies to protect the use of individuals’ voices in AI
  • Educating the public in understanding the capabilities and limitations of AI technologies, including the possibility of deceptive AI content
  • Accelerating the development and adoption of techniques for tracking the origin of audiovisual content, so it’s always clear when you’re interacting with a real person or with an AI

It's worth noting that OpenAI's concept would not be the only publicly available voice cloning solution. Currently, ElevenLabs is the most popular. However, even with many audio samples, the results are not always satisfactory.

Voice Engine appears to be a significant improvement in terms of both simplicity of use and cloned voice quality.

View them all