In December 2024, Google launched Gemini 2.0, its most advanced multimodal AI model to date, and it is an update that has quickly caught attention due to its revamp of native image generation and audio input and output.
Combine them together, and you can get image editing controllable by voice command, which may sound like another feature on a feature list — but when you try it out, it offers a real “Eureka!” moment.
You don’t need to spend hours creating images from scratch or hopping through menus and toolbars, you can just tell Gemini what you want it to create in natural language and then refine the image through conversation.
The widely available Gemini 2.0 model is the search giant’s attempt to bring the chatbot closer to being a universal artificial intelligence (AI) agent, as we pointed out in our AI predictions for 2025, and by doing so, it begins to step into Photoshop and other image editing apps’ territory.
This is Google’s most advanced AI yet. Techopedia examines Gemini’s new abilities and skills below.
Key Takeaways
- Google launched Gemini 2.0 in December 2024 — the tech giant’s most powerful model to date.
- Gemini 2.0’s ability to generate and edit images via voice instructions could eventually present serious competition for tools like Photoshop.
- It is part of Gemini’s plan to become a “Universal AI Agent” — capable of many actions autonomously.
- Photoshop remains a highly specialized tool, but casual users may find text-to-image and voice-to-image with Gemini to be more accessible.
- Adobe also added text-to-image capabilities to Photoshop in 2024.
Google’s AI Agent: A Look at Gemini 2.0
Google is attempting to turn Gemini into a multimodal Swiss army knife that you can use for everything from searching the web to answering questions, creating written content, and generating images on demand.
As Google’s documentation notes, the new Multimodal Live API allows users to create real-time vision and audio streaming applications with tool use and improves understanding.
For instance, Gemini 2.0 boasts an ability to follow complex instructions, assist to a higher degree with coding prompts, and the ability to call functions from elsewhere.
This helps move Gemini 2.0 into an agentic AI model — able to take actions rather than simply deliver text or image outputs.
However, users’ ability to create and edit images just with voice commands impressed many in the industry.
Powered by the Imagen 3 image generation model, you can ask Gemini to generate an image and edit it using follow-up verbal instructions. This includes removing objects from images and putting objects from previous images into another.
One fan is Alon Yamin, CEO and co-founder of Copyleaks, who told Techopedia:
“It represents a fascinating leap forward in AI-assisted creativity. This technology opens up opportunities for democratizing visual content creation, potentially allowing anyone with an idea to bring it to life with just a few spoken words.
“The ability to generate and manipulate images through voice commands could revolutionize workflows in various industries, from marketing and advertising to education and entertainment.”
How it Works: Testing the Gemini 2.0 Flash Experimental Model
Gemini 2.0 Flash Experimental models base capabilities are impressive.
To generate an image, all you need to do is enter a written or voice command detailing the type of image that you want to create.
To test this out, we instructed Gemini to generate an image of a dinosaur on the beach (the output of this image can be seen above).
Then we asked Gemini to edit the image, instructing it to “change the color of the dinosaur in the image to red.” This generated the following result:
This did change the design of the image somewhat, but we were satisfied that the depiction of the dinosaur remained consistent across the two images.
But what about adding an object to the image?
To push Gemini’s editing functions a bit further, we asked the model to generate the same image with a beachball shown on the beach. The results were as follows:
As you can see, Gemini added the beach ball correctly, even if it did displace the tree in the background, which appears to be hovering in the air.
A few hits and misses, but it is a far cry from where we were just a few years ago, dragging a cursor around to design visible assets from scratch.
Based on where Gemini 2.0 is at now, it has the ingredients to present a legitimate challenge to Photoshop in image creation.
Many users would likely opt to generate images with Gemini’s natural language capabilities before designing them from scratch via Photoshop or another similar tool — it is much more preferable to hunching over and drawing out designs from scratch.
You can see more examples of Gemini 2.0’s output here:
RIP Photoshop.
Gemini 2.0 now can edit image by talking to it , it's insane.
Here’s how it works: pic.twitter.com/qQ5rOxzBiq
— el.cine (@EHuanglu) December 13, 2024
As AI continues to mature, interacting with software via natural language and voice assistants is becoming a viable alternative to using digital tools.
Even Adobe has resorted to adding text-to-image capabilities in Photoshop, signaling a shift in user expectations.
However, Google is aiming to take things a step further.
“With new advances in multimodality – like native image and audio output – and native tools to use, it will enable us to build new AI agents that bring us closer to our vision of a universal assistant,” the announcement blog post said.
Wider Thoughts: Accessible Image Editing
While we were impressed with Gemini 2.0’s ability to generate images, and to understand objects depicted in the image, it is a bit rough around the edges.
For instance, during our testing, we noticed that voice commands were often mistranslated and had to be input multiple times.
That being said, with the help of Imagen 3, Gemini 2 does a great job of creating aesthetically pleasing images in a much more accessible way than a more specialist tool like Photoshop.
While Photoshop gives users infinitely more control over editing, it comes with a steep learning curve.
If you’re a graphic designer, learning how to use these tools can give you more control over your designs, but for most of us, Gemini is an infinitely more accessible and free alternative for creating images.
The Bottom Line
In its current form, it would be wrong to say that Gemini 2 blows Photoshop out of the water in terms of its overall capabilities.
However, what it does offer is a free image creation alternative that anyone can use without any specialist training or learning curve.
If virtual agents like Gemini continue to advance at this rate, then we could see their ability to interpret user instructions rival traditional editing tools — and the apparent ability of AI to absorb all tools continues.