Microsoft has unveiled a new AI Pandora box: VASA-1 is an artificial intelligence technology that can create hyper-realistic deepfake videos from a single picture in real time.
In what seems to feel like a deja vu of the February 2023 Bing AI chatbot release, when Microsoft rushed to be first to integrate into its web search engine a free and public large language model (LLM) generating more than one scandal and controversy, the new VASA-1, while impressive, promises to do the same.
How does VASA-1 work? What drives this technology at its core? What security and private issues does this new technology bring to the table? Is VASA-1 the end of Face ID and the start of the new era of deepfakes?
In this report, Techopedia dives deep into VASA-1 to answer these and other questions.
Key Takeaways
- Microsoft VASA-1 can generate hyper-realistic deepfake videos with just a single image and audio file.
- The VASA-1 technology generated concerns about privacy and security. VASA-1 could potentially bypass facial recognition systems and challenge the security of Face ID.
- The biometric industry is expected to adapt and develop new measures to address deepfake threats. But can they keep up with innovation?
- Advancements in deepfakes will likely change the way security training is conducted.
- VASA-1 has the potential for various applications in education, communication, and entertainment. However, the widespread use of deepfakes by cybercriminal groups worries the community.
VASA-1: Is the World Ready for Advanced Visual Affective Skills?
Before we analyze the security challenges of VASA-1 and whether the tech represents the end of Face ID, biometric security, and deepfake worries, it’s fundamental to understand the concept, technology, and innovations of the new Microsoft breakthrough.
This is wild.
Microsoft just unveiled their hyper-realistic talking head AI:
VASA is a framework for generating lifelike talking faces of virtual characters with visual affective skills (VAS).
All from a single static image and audio clip.
Their first model, VASA-1, can:
— Alex Banks (@thealexbanks) April 18, 2024
There are a dozen incredible examples on the Microsoft site, and more technical readers may wish to read Microsoft Research’s new paper introducing VASA-1 (PDF), released on April 16 on arXiv.
The paper describes VASA 1 as an AI framework for generating lifelike talking faces with “appealing visual affective skills (VAS)”.
VAS and Visual Stimuli
VAS refers to the ability to perceive and understand emotions through visual stimuli. This includes interpreting facial expressions, body language, and other elements in images and videos. Humans use VAS on a daily basis to communicate, socialize, understand, and “read” people.
But in the field of AI, VAS refers to a new concept where an AI system can generate visuals that evoke emotions in viewers. In the case of VASA-1, this involves creating hyper-realistic facial expressions on virtual characters.
The Wow Factor of VASA-1
VASA-1 can create deepfake talking heads that apply the concept of VAS despite being given very little information — it only needs one headshot photograph and a snippet of speech via an audio file.
Additionally, the tech does not require highly advanced computing power. It can produce lip movements that match the audio and image so well that even experts have a hard time discerning whether the image is real or not.
The Microsoft Asia Research paper claims that the tech outperforms all previous deepfake methods.
“Through extensive experiments including evaluation on a set of new metrics, we show that our method significantly outperforms previous methods along various dimensions comprehensively.”
VASA-1 generates video frames of 512×512 size at 45fps in the offline batch processing mode and can support up to 40 fps in the online streaming mode with a preceding latency of only 170 ms, evaluated on a desktop PC with a single NVIDIA RTX 4090 GPU.
Breaking Down the Core Innovations of VASA-1
Microsoft claims VASA-1 is paving the way for the future of online digital avatars that can perfectly mimic human conversational behaviors. Researchers explained that this new tech is possible thanks to core innovations. Microsoft researchers summarised the innovations in one sentence that requires a careful breakdown.
“The core innovations include a diffusion-based holistic facial dynamics and head movement generation model that works in a face latent space, and the development of such an expressive and disentangled face latent space using videos.”
Diffusion-Based Model
When Microsoft refers to a diffusion-based model, they are addressing a technique for gradually adding detail and removing noise from an image. Using this technique the system starts with a blurry picture and makes it clearer.
Holistic Facial Dynamics and Head Movement Generation
With holistic facial dynamics and head movement generation, instead of focusing on individual parts of the face (like lips or eyebrows), VASA-1 considers all facial movements and head turns as a whole, creating a more natural and cohesive look.
Latent Space
As researchers explain, VASA-1 works in a ‘face latent space’. Face latent space is a special coded environment that captures the essence of “key data” of a face, including information about facial features, expressions, and head position.
Finally, an ‘expressive latent space’ allows the model to generate a wide range of emotions and facial movements. Disentangled means that different aspects of the face (like lip movement vs. eye gaze) are represented separately in the code, making it easier to control them independently.
Microsoft Research released a framework for generating lifelike talking faces of virtual characters. The premiere model, VASA-1, can produce lip movements that are exquisitely synchronized with the audio & capture a large spectrum of facial nuances & natural head motions that… pic.twitter.com/eLcnreLSCL
— Antonio Vieira Santos (@AkwyZ) April 18, 2024
The End of Face ID and the Era of Deep Fakes?
As the passwordless future continues to advance, biometrics — and especially Face ID—have become one of the most advanced security technologies in existence. This has become stronger in the past few years thanks to advancements in machine learning (ML) that power user verification and authentication, along with the integration of high-quality cameras and sensors available in every modern smartphone.
As users around the world increasingly gain access to biometric technologies and other forms of modern authentication, Microsoft, Apple, Google and other top tech industry giants promise to fade out passwords.
However, VASA–1 style tech may well be able to bypass the most important biometric Face ID security guardrail: Liveness checks, where logging in may require a quick video capture of your face ‘in action’.
Does This Mean the End of Face ID?
Microsoft Research blog póst reassures the public and says all faces used in the presentation of the technology are not real people but identities generated by StyleGAN2 or DALL-E 3 (except for the example made from the Mona Lisa).
“We are exploring visual affective skill generation for virtual, interactive characters, NOT impersonating any person in the real world.”
Microsoft goes on to explain that VASA-1 is currently only a research demonstration. The company adds that there is no product, application programming interface (API), or release plan for it.
While the VASA-1 is not currently available to the public, Techopedia expects this new release to create a competitive environment in which different technology companies attempt to reach the same advanced deepfake generation milestone.
Additionally, the research paper of VASA-1, along with the fake videos presented on the Microsoft site, can be a good starting point for anyone interested in reverse-engineering this technology — and this includes highly funded, rich in resources, cybercriminal groups, and international criminal syndicates.
How the Biometric Industry Responds to New Threats
The biometric community has grown exponentially in recent years. Precedence Research estimates that the sector will grow from $41.58 billion in 2023 to $267 billion by 2033.
As biometrics advances and becomes mainstream, biometric security companies develop solutions to deal with new threats.
For example, in the early years of biometrics, “masking attacks” — the malicious process of presenting fake real user’s physical characteristics (face mask) — became a real problem. However, the biometric industry soon released liveness checks to combat these fraudulent attempts to breach their systems.
The same happened recently with biometric injection attacks. As the sector realized that criminals were advancing in their techniques and feeding fake data, like deepfake videos or pre-recorded voice samples, to gain unauthorized access, the industry responded.
Companies like Innovatrics — a leading global biometric provider — released video injection attack detection technologies to combat deepfakes and synthetic identity fraud.
Daniel Ferak, Innovatrics Business Unit Director, told the press during the tech’s presentation: “Recognizing the rising use of video injection spoof attacks by fraudsters, Innovatrics’ advanced algorithms can now secure the camera used during identity verification, preventing video injection spoofs and man-in-the-middle attacks.”
Innovatrics is not the only biometric company offering this technology. Biometric injection attack defenses are now considered the norm and are used worldwide by companies, governments, border authorities, airports, and land crossings.
But is the tech capable of dealing with the next-gen of deepfakes? Is AI moving faster than biometric security innovations?
According to Gartner, the answer is: No. Gartner says that by 2026, 30% of enterprises will no longer consider such identity verification and authentication solutions to be reliable in isolation due to increased AI-generated deepfakes attacks and face biometric attacks.
Akif Khan, VP Analyst at Gartner, discussed the evolution of AI in the report and how it can erode trust in biometric security.
“In the past decade, several inflection points in fields of AI have occurred that allow for the creation of synthetic images. These artificially generated images of real people’s faces, known as deepfakes, can be used by malicious actors to undermine biometric authentication or render it inefficient.
“As a result, organizations may begin to question the reliability of identity verification and authentication solutions, as they will not be able to tell whether the face of the person being verified is a live person or a deepfake.”
Proof of Life: Deepfakes Changing Security Training
Anna Redmond, Founder of Braav — a venture-backed startup making it easy and seamless for companies to find fractional Chief Security Officers for their needs — spoke about how advancements like VASA-1 disrupt security.
“The ability to create realistic avatars from a photo will dramatically change security training.
“Right now, security professionals operate based on requests from vetted, trusted individuals,” Redmond said.
“Does this mean that the new standard of security is – if you didn’t talk to me in person, I can’t accept it? Or will we need to devise new secret phrases and passcodes so that we know the person we’re talking to is a real human and not an avatar?”
In security, there’s a concept called “proof of life,” which is often a question that only a few people will know.
“It’s [proof of life] usually used in scary, critical situations like kidnappings when you need to ascertain if the person you care about is actually still alive. I wonder if we’ll start using these questions not for proof of life but for proof of authenticity.”
Does Vasa-1 Have Business Potential?
Microsoft claims that VASA-1 has the potential to enhance numerous areas and sectors, from educational equity to accessibility for individuals with communication challenges to therapeutic support for those in need.
However, businesses and organizations around the world consider other use cases when they see what VASA-1 can do. Next-gen AI deepfake chatbots are expected to drive marketing, customer support, sales, and other areas key to revenue generation.
Kevin Surace — the “Father” of the Virtual Assistant & Voice User Interface — and Chair at security company Token shared his insight with Techopedia for this report.
“Microsoft’s entry here is excellent and state-of-the-art across all models I have seen. The implications for personalizing emails and other business mass communication are fabulous, and it even allows for animating older pictures.
“To some extent, this is just fun, and it has solid business applications we will all use in the coming months and years.”
Surace said that the technology can also be used to replace a live webcam with a virtual version of yourself, “especially when you have a bad hair day”.
“But of course, the images we see today are already a digitally reproduced image of you — meaning the webcam is gathering pixels, processing them, compressing them, sending them across the country, and recomposing it on someone’s screen,” Surace added.
“This is arguably the next extension of that, manipulating the pixels in real-time so that you can truly look your best. And it’s still your voice and your words.”
Surace added that media and entertainment also stand to benefit as low-cost content creation at scale gets better, the tech also democratizes access. The same applies to marketing and mass communications.
“That’s great for creators… even if overwhelming for the viewers.
“This continues to take us closer to perfect video and audio representations of ourselves with and without our permission.
“Of course, the major models will include a watermark stating this is AI-generated.
“But in time, open-source models will emerge that don’t…”
The Bottom Line
The future of next-gen deepfakes is inevitable. AI tech is only getting better, and while businesses and sectors like education, accessibility, and healthcare stand to gain it will be increasingly more difficult to recognize a real person from a fake one online.
There have been endless cases in which cybercriminals have used deepfakes either to hijack YouTube channels and post fake content to steal money from users or convince workers to wire millions of dollars to an unknown account.
Furthermore, the training and use of these models also pose serious privacy, compliance, and ethical challenges. As technology continues to innovate, big tech will continue to release impressive but dangerous advancements, leaving users and experts wondering about the risks.