Deepfake Virtual Assistant

Deepfakes are synthetically created videos, that use deep learning to swap the likeness of one person with another.

When I was asked to host our team's monthly showcase, I asked myself a question- is there a better way to do this? (note: turn on audio).

I created a "virtual assistant" who was able to host the meeting for me. The agent wasn't technically a technically a deeepfake, but rather an example of AI driven puppetry. It was built using two key elements:

- An AI model which is able to recreate the idiosyncrasies of my voice
- An AI model that takes in audio, and creates artificial video where my mouth matches the generated audio

Audio Generation

To create this demonstration, I explored a range of methods for cloning my voice.

I first explored zero shot learning techniques, that only required ~5 seconds of audio to create a "profile" of my voice, from which new speech could be generated.

https://github.com/CorentinJ/Real-Time-Voice-Cloning

Real Time Voice Cloning

Unfortunately, the "pretrained" model was built using North American data samples, so whilst this approach works for American speakers, it led to my Australian accent having a bizarre American twang.

I resigned myself to needing to build a large training dataset of my voice. Tacotron 2 allows for text-to-speech models to be created, leveraging a pre-built model and transfer learning to reduce the amount of training data that is needed.

https://github.com/NVIDIA/tacotron2

Using a high quality condenser microphone, I collected 500 data samples (about 30 minutes). This involved recording short, 5 seconds clips with matching transcripts. The amount of data used was key to the quality of the final result; 50 samples led to an unintelligible model, 200 samples led to a "slightly American" model that was recognizeable as myself, and 500 samples led to a model that impressively captured the unique features of my voice.

Initially I trained a model using the open source code, however with only three days to create this demonstration, I looked for a managed service to speed up this process. Microsoft had a managed service simplifying this process, as did Resemble.ai, which is the service I chose to use. These services allowed various prosodic features to be defined (such as stress, intonation and rhythm) to help make the generated voice more natural.

Video Generation

Typically for deepfake videos, an actor is recorded, and this performance is overlaid onto the original video.

For this demonstration, the intent is to instead "puppet" the original video, animating the mouth to be synchronised with the synthetically generated audio. The code repository used is found here:

https://github.com/yiranran/Audio-driven-TalkingFace-HeadPose

The solution is fairly complex, reconstructing a 3D model of the target face, analysing the input audio, blending the 3D model onto the original source video, and finally animating the result.

Audio Driven Head Pose

Applications

Whilst the chosen use case was fairly light hearted, these technologies present opportunities to change how people interact.

Imagine a future where these virtual assistants are connected to chatbots, and are able to attend meetings for you and present on your behalf. Or perhaps consider a future where you call customer support, and the person on the other end is the CEO of the company (as a virtual agent!), personally helping resolve your issue.

The ability to create high quality voice models also increases the opportunities for personalisation. The voices in smart home assistants such as Google Home could be customised, to use celebrity voices or even your own voice!