An interview with my deepfake digital twin
I was involved in running a generative AI workshop, together with a group of colleagues, to train others in our business on how best to leverage these technologies. What better way to introduce the session than an "interview" between myself and my deepfake digital twin. Here is the video of the demonstration:
It's been almost two years since I first sent an AI assistant to attend a meeting for me. That AI agent was pre-rendered, using a few clever tricks to interact with the audience, and the question in my mind since then has been; "is this possible to do in real-time, to have an AI assistant that can respond back to people"?
Real time deepfake video technology has been around for a little while. The key challenge is that, while the person using the deepfake may look like the target person, their voice will still be the original person's voice.
In this demonstration, I have managed to combine both real time deepfake video and real time voice transformation. This is no mean feat, and I suspect this is one of the first times that this has ever been done, as I have not yet come across examples of this online.
As you can see, compared to the real person, it isn't perfect, but for those who aren't familiar with my appearance and voice, it is more than enough to fool a casual observer.
A note on AI ethics
Whilst this technology is incredible, there are significant potential harms that may result from the misuse of this technology. The Australian government's AI Ethics Principles are central to all of the work I do. One of these principles is transparency, and it was important during this demonstration that I immediately disclosed that this was in fact a deepfake. Another important consideration was that deepfakes were only used to replicate the likeness of people who had given explicit, written consent.
Use cases for deepfakes
My current work involves developing products that help protect Australian's from fraud and scams, and deepfakes represent a significant threat.
Whilst today, significant technical knowledge is required to achieve a feat like this, in the future it will be possible to clone a person's video and audio with only a small amount of sample data. There are already reports of scammers imitating the voice of family members, for phone scam calls, so we are just starting to see the emergence of generative AI being used to generate new attack vectors. Ongoing research into this area is needed to counter misuse of AI technologies.
Of course, there are many positive applications to these technologies. In the film industry, deepfakes provide a low cost alternative to complex CGI. For accessibility, synthetic voices can be used to unique, personal voice replacements for people who have lost the ability to speak due to illness or injury. For education and marketing, deepfakes allow a way to create content at a scale that was previously cost prohibitive. It will be interesting to see how this field evolves in the coming years.
TECHNICAL DEEP DIVE
The remainder of this post will involve a deep dive into the technologies, challenges and learnings gained throughout this project.
Throughout this post, the person using the deepfake is referred to as the actor and the person whose likeness is being replicated is referred to as the target.
Early exploration and key challenges
There are many different approaches for creating real-time deepfakes, and a range were explored at the start of this project. This section explores some of the (often very humourous!) early attempts, and highlights the key challenges that the selected technology approach needed to overcome.
Animating photos using facial features captured from a webcam allows for a "one-shot" method, where you simply upload 1 photo of the target, and use computer vision techniques to animate the photo. Whilst impressive for the low amount of effort, this approach sits firmly in the uncanny valley.
Early experiments led to some very amusing results- a key to a successful deepfake is having the model learn how to blend the target's face into the actors.
Here you can see the blending of the target face is working much better- however the model is struggling with a variety of face angles, when the actor is not looking directly into the camera.
Face size is another key parameter- this is normally tuned manually until the face is approximately the correct size. Some models are also able to warp the shape of the face, to ensure that if the target and actor have differently proportioned faces, the deepfake still works well.
Training the video deepfake model
Deepfacelab (https://github.com/iperov/DeepFaceLab) was identified as the best approach for this project. It has an excellent git respository, and an active community who have deep exerience using the software.
Previous deepfake workflows involved training a model that could only convert a specific actor into a specific target (one-to-one). This means that for the demonstration, only actor the model would be trained on could use the deepfake model. It also means a large dataset of the actor is required, to help the model learn the features of the person's face.
One-to-one training of a deepfake model
Somewhat analagous to fine-tuning from other domains of machine learning, there is a newer ReadytoMerge workflow, that provides a pretrained model, trained on 50,000 faces which include a variety of face shapes and lighting conditions. This results in a deepfake model that is many-to-one. Whilst performance may not be as good as a perfect one-to-one model, in practice, the model is far more robust, and as a benefit, can be used by any actor without retraining.
Many-to-one training of a deepfake model
Training the model took almost 2 weeks of continuous training. The main limitation was time until the workshop date. A better model was trained after the date of the demo, which was trained for the full 4 weeks, which led to better results than those shown in the videos.
The final deepfake model output
Training the audio deepfake model
In my previous AI assistant, Tacotron 2 was used, which is a text-to-speech model (it takes in text, and generates audio). It is able to create highly realistic sounding voices of a target.
However, to create a real time deepfake, a speech-to-speech model is needed, which takes in the audio of the actor, and converts it to the target's voice.
The most impressive part of the model is that it does not use a labelled set of data. The only input is audio recordings of the target. For most ML models, you provide training examples and labels. For example, for Tacotron 2 the text-to-speech model used in my previous deepfake project, you must provide the audio of the target's voice, as well as a transcript. This allows the model to learn the connection between the input data (text) and the output (audio).
However for Voice.ai, only audio of the speaker is required. One downside of this, is that significantly more data is required. Whereas Tacotron achieves highly realistic voices with 30 minutes of data, multiple hours of audio recordings were required in order to achieve an acceptable level of quality from Voice.ai.
Here is an example of converting my voice into one of the sample voices- this not running in real-time, hence the higher quality:
Here is a comparison of the real-time voice model, converting the actor's voice into the target's voice:
Putting it all together, in real-time
With both models trained, it's now time to set everything up!
Choosing the right actor
An interesting aspect of deepfakes is that they only replace the face itself. There are some models that do "whole head" replacement, but these are far more complex to train and not able to be run in real time.
The more similar the actor is to the target, the better the end result. There are a number of factors that go into this- the spacing of the actor's eyes, their hair colour and style, whether they have a beard, the size of their face relative to their head, as well as other features.
William, a colleague and friend, volunteered as the actor, and to setup his GPU PC with all the required software.
He and I have similar face structures and features, however there were a few challenges.
Matching the facial hair
A well known (and infamous!) example of replacing facial hair using deepfake technology is from the Justice League, where Henry Cavill's moustache was replaced. In motion, it looks especially strange, and was widely panned by audiences.
Our first attempts at replacing the facial hair did not work well (though interestingly, a few models we trained after the workshop did a reasonable job). Will was unwilling to shave his beard, so I had to urgently grow a beard, generate a new training dataset, and rebuild the model.
Using the same background
Our next challenge was the background. After years of working from home throughout COVID, most of my colleagues are very familiar with my video conferencing background (and wall of instruments).
This was fixed with a virtual background tool, similar to those inbuilt in many video conferencing tools such as Zoom and Teams. I provided Will with an image of my office, taken from my webcam.
Another challenge to overcome was that Will has a lighter hair colour than mine, with a slightly red streak.
Matching the hair styles
Fortunately, Will was happy to use a temporary black hair spray to dye his hair on the morning of the workshop, which was key to making it look believable. Often a wig is used, when deepfakes are used on film sets!
Matching the recording device
The original performance of the audio deepfake was quite poor- we realised that matching the recording device was very important. The model was trained on data using an expensive, condenser microphone. When performing inference (trying to use the model), Will initially used a bluetooth headset. These typically have a fairly poor quality microphone. When I provided Will with the condenser microphone, the model performance significantly improved, as the input audio became more similar to what the model was trained on.
Matching the latency
The video deepfake was able to run in real-time (low latency), with a high frame rate (~30FPS). There was almost no perceptible delay between the actor's movement and the deepfake output.
This is largely due to the fact that the video stream does not have a temporal component- that is, each frame that is generated by the deepfake is mostly independent of each other. What happended in the last frame, does not influence the face replacement occurring in the next frame. Note that there may be some processing that occurs across frames, to smooth the final result. This results in a low input latency.
The audio streaming is an entirely different story. The audio model has an extremely important temporal component- the phonemes (each unit of sound that makes up wowrds) that came before, and the phonemes that come after, are key to morphing one speakers voice into another's. The AI needs to know what phonemes follow the sound it is currently trying to generate, to ensure that the words flow smoothly into one another.
Voice.ai includes a parameter for this, when doing real-time voice morphing. The higher you set the quality, the longer the delay between speaking, and hearing the deepfake audio. That is, the longer the amount of time the AI has to listen to what you say next, the better job it is able to do at mimicing the target's voice.
Setting the delay to be almost instaneous led to very poor quality voices. At the other extreme, the voice would sound very good, but would have a lag of almost 5 seconds.
In order to have a conversation between myself and the digital twin, we chose to set the input latency at 1.5 seconds. This meant that every time Will answered a question, there was an additional 1.5 seconds before he would be heard. 1.5 seconds was about the limit before the delay between responding would start to become noticeable.
Furthermore, Will needed to introduce an artificial delay to the video stream, so that the audio and video would be synchronised (otherwise the lips would be out of time with the audio).
The final setup
Will joining the Teams workshop, as a digital twin of myself.
The extended cut
If you are interested, the full interview is included below. There was even a mistake at the start (which was removed in the shortened version), where the actor forgot to unmute their microphone before talking. It actually helped add to the realism, and sell the effect- perhaps an idea for future use cases!