Such a cool week! AI tools are finally getting good enough to fit naturally into the developer workflow. Coding assistants turn into our collaborators, allowing us to save time spent on repetitive coding tasks and achieve real productivity gains. In line with Microsoft’s focus on agentic web and GitHub Copilot updates, presented at Microsoft Build, today we decided to explore other new important coding agents that deserve your attention: self-evolving Google DeepMind’s AlphaEvolve and first cloud hosted Autonomous Software Engineer (A-SWE) OpenAI’s Codex. (Coding agents and models are certainly trending this week with Google’s asynchronous coding agent Jules, Mistral’s Devstral coding model etc also announced).
Google DeepMind introduced AlphaEvolve just ahead of the Google I/O. It’s an evolutionary coding agent designed to autonomously discover novel algorithms and scientific solutions. It’s the breakthrough that we really needed to reshape how AI can optimize engineering algorithms (even the building of hardware) and solve complex problems. Meanwhile, Codex from OpenAI is a powerful AI-powered coding assistant that writes, tests, and fixes code. It works safely with your repository and acts as a virtual coworker within ChatGPT. Let’s explore what is so special about these releases, how they work and revolutionize coding and AI in general!
Follow us on 🎥 YouTube Twitter Hugging Face 🤗
In today’s episode, we will cover:
Evolution is the key: Google DeepMind’s AlphaEvolve
How does AlphaEvolve work?
Achievements of AlphaEvolve
Not without limitations
OpenAI’s Codex coding assistant
How does Codex work
How good is Codex?
Codex CLI: complementing Codex
Current limitations
Conclusion
Sources and further reading
Evolution is the key: Google DeepMind’s AlphaEvolve
Let’s start with Google DeepMind, which “Alpha” developments have been revolutionizing not only AI field for a very long time. Do you remember the legendary AlphaGo, which was first AI system to defeat a world champion in the complex board game Go? Then came AlphaZero, a general-purpose game-playing AI that mastered chess, shogi, and Go without human data. It learned on its own through self-play, and it was a moment when Monte Carlo Tree Search (MCTS) shined as a powerful algorithm to decide which move to play next.
One of the most outstanding breakthroughs is AlphaFold, which solved one of biology’s biggest mysteries: how to predict a protein’s 3D shape just from its amino acid sequence. Thanks to combination of machine learning, evolutionary biology, and structural modeling, it has revolutionized our ability to understand proteins and saved years of labs work. It predicted the structures of over 200 million proteins and sped up drug discovery and medical research. After all, AlphaFold brought its creators, Demis Hassabis and John Jumper, the Nobel Prize in Chemistry.
Google DeepMind’s problem-solving tools are also notable for their achievements. AlphaGeometry excels at solving complex geometry problems, performing comparable to top human competitors in mathematical olympiads. AlphaProof, combining language models with reinforcement learning to translate natural language problems into formal proofs, achieved results on par with silver medalists in the International Mathematical Olympiad.
As we can see Google DeepMind’s developments go beyond just AI, serving for practical fields, which are crucial for our world, and building the most powerful algorithms that can learn on their own. And now its time for the new AlphaEvolve.
At its core, AlphaEvolve is an advanced AI tool that helps LLMs become even better at solving very complex problems – like improving computer systems or tackling scientific challenges. It’s a coding agent inspired by the concept of evolution, which works similar to natural selection. It tries out changes to the code, gets feedback, and keeps improving over time. This tool automates part of the scientific discovery process, which involves brainstorming ideas, trying things out, learning from failures, and refining results. This process can take years, but AlphaEvolve speeds this up by letting AI do a lot of that work.
Let’s break down what technologies stay behind AlphaEvolve.
How does AlphaEvolve work?
In AlphaEvolve, Google DeepMind uses an ensemble of top Gemini models to enhance creativity and automated evaluators to verify solutions. Here is how AlphaEvolve works step-by-step:
A user starts by giving AlphaEvolve an initial code and evaluation function – the scoring system that helps to identify how good a solution is. It is usually just a Python function that runs the solution and returns scores. In essence, the user defines "what success looks like" through the evaluate function. AlphaEvolve then uses LLMs to propose "how to get there" by modifying the code, and the evaluate function provides the crucial, objective feedback that guides this search for better solutions.
After that AlphaEvolve enters a loop that keeps running until it finds really good solutions.

Image Credit: AlphaEvolve original paper
Special markers (like # EVOLVE-BLOCK-START) tell AlphaEvolve which parts of the code it’s allowed to modify and improve. Everything else stays as the “skeleton” that ties it together.
AlphaEvolve looks at previously tried programs stored in its Program Database and puts together a prompt (user’s instructions + examples) to help LLMs understand how to improve the code.
Generating ideas with LLMs: This is the most interesting part, because AlphaEvolve uses Google’s current most powerful models – Gemini 2.5 Flash and Gemini 2.5 Pro to suggest code edits and improvements:
Gemini 2.5 Flash is a faster explorer. It maximizes the breadth of ideas, quickly generating a large number of code edits.
Gemini 2.5 Pro gives critical depth to suggestions. It applies deeper reasoning and context understanding to come up with larger or more innovative changes.
Each new version of the code is run and scored using the evaluation function provided by the user. AlphaEvolve runs simple checks or complex tests in parallel to speed up the process.
AlphaEvolve stores every attempt, good or bad, in the Program database, and it keeps improving solutions that worked best. This automatic feedback loop helps the AI avoid bad suggestions.
Compared to other systems, AlphaEvolve can write large, complex code, work in any programming language and run evaluations in parallel on powerful computers. It can also be optimized for multiple goals at once, like achieving both high speed and accuracy. Where has AlphaEvolve been used already?
Achievements of AlphaEvolve
AlphaEvolve has already shown amazing results in math, computer science, and engineering, especially where results can be tested automatically.
Firstly, it discovered new solutions for math problems and new algorithms:
Matrix multiplication: AlphaEvolve improved on a famous algorithm, Strassen’s from 1969, by finding a faster way to multiply complex 4×4 matrices.
It solved or improved solutions to over 50 open math problems, beating the best-known answers on 20% of them, like the Minimum Overlap Problem, Kissing Numbers in 11 dimensions, an uncertainty principle problem in Fourier analysis, and improving known bounds in autocorrelation inequalities. All of it is truly mind-blowing.

Image Credit: AlphaEvolve original paper
Secondly, AlphaEvolve upgraded engineering at Google (put your tools into your own ground!):
It created a better scheduling system for Google’s data centers. It fitted jobs into machines based on CPU and memory, like packing items into boxes. This saved 0.7% of total compute resources across data centers. It’s a HUGE win at Google’s scale.
AlphaEvolve helped to speed up training of Gemini models. It discovered a tiling rule (breaking big computations into smaller chunks) that gave a 23% speedup in kernel runtime, which led to a 1% reduction in overall Gemini training time.
AI-assisted chip design: AlphaEvolve solution is even applied in upcoming Google's Tensor Processing Unit (TPU). It smartly simplified computer hardware design, removing extra bits and designing faster arithmetic circuit for matrix multiplication.
AlphaEvolve also improved models in general, speeding up transformer models’ FlashAttention. Working on complex code to make it run faster, it tested changes to the attention kernel itself and improved the pre- and post-processing steps. As a result it achieved up to a 32.5% speedup for the core FlashAttention computation and 15% speedup in surrounding code.
Tasks that used to take developers months can now be solved in just a few days with AlphaEvolve. This shows that AI can contribute directly to everything from cloud computing to chip design. What is also cool is that this breakthrough delivered real-world performance gains at the scale of Google. However, there are some limitations.
Not without limitations
The main limitation is that AlphaEvolve needs an evaluation function to identify good and bad solutions. It is good for math, algorithms design and engineering optimization but it can’t be applied for more complex scenarios, like biological experiments, social science research, or artistic creativity, where results must be interpreted qualitatively.
Not optimized for LLM-only feedback: it needs to 'run' an idea (as code) to test it, so it struggles with purely conceptual or abstract ideas an LLM might judge.
As AlphaEvolve relies on code, it is limited to domains with simulations and real-world experiments.
Slow feedback loops: Practical infrastructure improvements take much time to feed back into the system.
Some gains, such as a 1% reduction in Gemini’s training time, can be considerate modest.
Despite this, AlphaEvolve marks a breakthrough in coding agents, which can be widely used to optimize the workflow in wide range of areas. Yes, the improvements may not be huge, but when considered at scale, they bring tangible benefits to the industry. Of course, this is just the beginning for AlphaEvolve, most likely DeepMind has a lot to surprise us with.
Now let’s move on to another AI-powered coding assistant that reverberated across all social media.
OpenAI’s Codex coding assistant
Codex is a cloud-based software engineering agent from OpenAI that goes beyond just assisting with code and acts more like a collaborator to supports developers. It can perform various coding tasks in parallel:
Write new features
Fix bugs
Clean up code (typos, inconsistencies, etc.)
Answer questions about your code
Suggest pull requests (PRs)
Work with your actual codebase
Be left to work in the background while you focus on other things
You can use many Codex agents in parallel, running in the cloud, so they don’t slow down your laptop. What is especially convenient about Codex is that it works directly through the ChatGPT sidebar (but only if you’re using the Pro, Team, or Enterprise versions; support for Plus users coming soon).
Codex is powered by a special version of OpenAI’s o3 model fine-tuned for software engineering tasks, called codex-1. It was trained using:
End-to-end reinforcement learning on real-world code tasks. It learned by trying, testing, and correcting.
Focus on following developer instructions, writing clean pull requests, and fixing code that passes real tests.
How does Codex work?
Codex runs with your repository and in your environment, and for that you’ll need to connect your GitHub account to Codex. Here is its step-by-step working process:
To assign tasks to Codex you can click “Code” button for a coding task, and click “Ask” to ask questions about your codebase.
When Codex performs the task, it can read and modify files, run tests, type checkers, and linters, edit or generate code based on user instructions.

Image Credit: A research preview of Codex in ChatGPT video
Each task runs in its own secure sandbox, loaded with your code, so Codex can test and work safely without breaking anything in production. It works only with code from your GitHub repo and pre-installed tools or libraries you define in a setup script.
When Codex finishes a task, it demonstrates its changes in the cloud environment, providing citations and logs (like test outputs or terminal commands) so you can see exactly what it did. User can then review, revise, merge the code, or open a GitHub PR.
This is optional, but a user can add a file called AGENTS.md to the project (it’s like a README for the AI) to help Codex understand how the code works — for example, which commands to run, how the repository is structured, what tests to run, or how to follow the team’s coding standards.
There is also a cool feature, signaling that Codex is a smarter system than many others. If a test fails or it’s unsure about something, Codex tells you clearly — so you know there might be issues. It won’t silently continue or “fake” a result, and it’s an important part of its trustworthiness. Codex agent asks for clarification, pauses when stuck, and gives you visibility into its thinking.
How good is Codex?
Many companies, such as Cisco, Superhuman, Temporal, Kodiak Robotics and also OpenAI’s engineers use Codex for different purposes which proves that it is indeed a convenient powerful coding tool. Here’s what it can do effectively:
Speed up common coding tasks: It can handle medium-complexity tasks that would take human developers 1–30 minutes each.
Scale to handle many tasks in parallel, helping teams move faster.
Help both with day-to-day development and larger codebase understanding.
Debug and refactor large codebases.
Run background tasks so engineers can stay focused on their main tasks.
Make small code changes, reducing the need to loop in engineers, except for reviews.
Developers can just assign Codex a task, walk away and came back to a finished task, that will work correctly.
As for the benchmarks, tailored specifically for coding, the codex-1 model powering Codex performs better in software engineering tasks compared to other OpenAI's top models.

Image Credit: Codex blog
Codex can also work with a very large context window up to 192,000 tokens, processing a lot of code and information at once, which is great for big codebases.
Overall advantage of Codex is that its output looks and feels like code a human would write. They are cleaner, more readable and better fit into the real-world software development workflows.
Codex CLI: complementing Codex
An expansion of Codex agentic ecosystem is Codex CLI, a lightweight coding agent that runs directly in your terminal (command line) on your own computer. It lets you interact with AI in real time, right in your coding environment, performing pair programming.
To power Codex CLI, OpenAI developed a smaller, faster model called codex-mini-latest, which is based on o4-mini. It’s ideal for quick, low-latency tasks, and retains strong instruction-following and code style capabilities.
To work with Codex CLI, you can sign in with your ChatGPT account — no need to manually manage API tokens.
Codex CLI (local) and Codex (cloud) are two sides of the same system. Together, they form a hybrid workflow:
Codex CLI is better for fast, interactive tasks and quick local edits.
Codex — for long-running or large-scale jobs.
They can blend into one experience, where Codex becomes your co-worker, pair programmer, and assistant across every part of your stack (from terminal to cloud infrastructure) and available in both real-time and background modes.
On mobile
Codex is now available on mobile devices through the ChatGPT iOS app. OpenAI has integrated Codex into the ChatGPT app, enabling users to initiate coding tasks, view code differences, request modifications, and even push pull requests directly from their iPhones. The way we code is changing fast. Now you can write, edit, and push code straight from the ChatGPT app on your phone. No laptop, no IDE – just your phone and AI. Soon, starting a project might mean just saying what you need. Developers will need to give clear instructions, understand the logic, and check results quickly. It’s a manager’s work but with deep understanding of building stuff.
However, there are some issues that now limit Codex’s potential.
Current limitations
Codex is still a research preview, so some features aren’t available yet:
It can’t use images, so it won’t help you with visual or UI-based frontend tasks.
You can’t guide it in the middle of a task. You only can assign a task and then wait for the result.
It can feel slower than editing code by hand, especially while it runs in the cloud.
Codex is just at its starting point. If these issues are overcome, using Codex will feel more like working with a helpful teammate.
Conclusion
Our coding experience is changing fast. With tools like Codex now running inside the ChatGPT mobile app, developers can start, edit, and push code from a phone – no laptop or IDE needed. At the same time, AlphaEvolve from DeepMind shows how agents can explore solutions on their own, evolving better code through automated testing and scoring. We're moving from writing code line-by-line to guiding smart systems that take over much of the routine. AI is becoming a true collaborator, and you are becoming it’s thoughtful manager. Yes, 2025 is the year of agents. And it also a year of human-to-agent collaboration expansion.
Sources and further reading
Introducing Codex (blog)
Codex (starting page)
Codex CLI and Sign in with ChatGPT (setup blog)


