What Is Skill Engineering for AI Agents?

TL;DR: Skill engineering is becoming the next optimization layer for AI agents. SkillOpt trains one skill, SkillOps maintains whole skill libraries, and SkillMOO optimizes coding-agent skill bundles for quality and cost. The core lesson: better agents need cleaner, validated, reusable skills.

Today you are probably using (or at least tried) personal AI agents like OpenClaw and Hermes Agent for your everyday workflow. They have already built a loyal following, including Jensen Huang, and their popularity is only growing. Part of what makes these agents work is reusable skills: instruction packages that define how an agent uses tools, structures workflows, makes decisions, and solves recurring tasks. Skills have become the operational core of agent behavior.

But the requirements are shifting. Improving agent performance used to come down to prompt and context engineering. That's no longer enough — skills now carry an additional layer of context and knowledge, which raises a natural question: how do you optimize the skills themselves?

Today, most skills are written by humans, generated once by an LLM, or refined through trial and error. (Hermes gestures toward something better, improving skills during use, but the broader practice remains ad hoc.) Recently, several methods have emerged to make skill self-improvement systematic: one trains individual skills, one manages the whole skill library, and one optimizes skills specifically for software engineering.

We'll break down exactly how each works, step by step, so you can apply the ideas to your own agentic workflows. Skill engineering is still early, and most people are not paying attention to it yet. We are. Let’s dive in!

In today’s episode:

From prompt engineering to skill engineering
Skill engineering for AI agents: what is it?
SkillOpt: training one reusable agent skill
Maintaining skill libraries with SkillOps
SkillMOO: optimizing skill bundles for software engineering agents
Conclusion: the lessons for skill engineering
Sources and further reading

Prompt Engineering vs Context Engineering vs Skill Engineering

Let’s start from the main concepts and descriptions. Engineering in this topic means how we design the whole operating environment around the agent. And there are three layers that we can work with.

Prompt engineering

Prompt engineering is the most familiar one. It means writing a good instruction for a specific request. You tell the model what you want, maybe give it a role, constraints, examples, formatting rules, and success criteria. For example: “Summarize this paper in simple language, focus on the method, and give me five main takeaways.”

The main feature: prompt engineering is usually local and moment-specific, because it helps the model perform one task better right now.

However, a great instruction can’t compensate for missing context. A prompt can be perfectly written, but if the model does not have the right documents, tools, memory, or task state, it will still fail. That brings us to the second layer.

Context engineering

Context engineering is about assembling the right environment around the model at runtime. It includes the system message, tool descriptions, retrieved documents, memory, examples, current task state, previous actions, constraints, permissions, and sometimes even the model’s budget for tokens, time, or tool calls.

In other words, it is a creation of basic context that a model or an agent should know and have access to when it acts or responses. And this context is what you should need to optimize, because:

Different agents need different context, like a coding agent needs the relevant files, tests, issue description, repo structure, and a research agent needs papers, citations, hypotheses, notes and experiment results.
Too little context makes the agent guess, while too much context makes it distracted or expensive.
Stale context makes an agent confidently wrong, and noisy retrieved context can push it toward irrelevant decisions.
If last year was the year of prompt engineering, 2025–2026 is the year of context engineering — and its implications for agentic coding systems run deep. See how practitioners at the AI Engineer Summit applied these principles in practice: State of AI Coding: Context, Trust, and Subagents.

And the newest layer is →

What Is Skill Engineering for AI Agents?

This part requires creating reusable capability packages that an agent can discover, apply, improve, version, and transfer across tasks. In simple terms, a skill is a mini-procedure: it tells the agent not only what to do, but how to do it again.

That makes skills feel closer to software artifacts than disposable prompts. A prompt is usually written for one situation. A skill is meant to survive across situations. It can be tested, maintained, shared across agents, and updated as the workflow changes.

This becomes especially important as agents turn into longer-running systems. The more often an agent repeats a workflow, the more valuable it becomes to give that workflow a stable shape. A library of skills can make agent behavior more consistent, more inspectable, and easier to improve over time.

But there is another side to this. If skills become part of the agent stack, they also become a new surface for errors, misuse, and attacks.

That is the point we’ll focus on next.

SkillOpt: training one reusable agent skill

Among all efforts aimed at improving agents’ skills, there is one recent research from Microsoft that really stands out. They proposed SkillOpt which is “a systematic controllable text-space optimizer for agent skills.” SkillOpt treats the skill document itself as something that can be trained. It creates a pipeline “agents → skills → self-improving workflows”. The core twist is →

that it keeps the agent model’s weights fixed and improves only the external skill text that the agent uses. The core idea is that instead of relying on one-off prompting or unstable skill rewriting, it repeatedly tests, edits, validates, and keeps only improvements. The final skill remains readable, reusable, and does not require changing the model itself, which saves resources.

The workflow of SkillOpt

The method works like this:

An agent skill in this concept is a natural-language policy that is given to the agent before it starts solving a task:
- The skill can be placed in the system or developer instruction if an agent works in a direct chat setting.
- In a tool-use setting, the skill acts more like persistent procedural memory: a reusable set of instructions that tells the agent how to search, reason, use tools, check its work, and format the final answer.

Image Credit: SkillOpt original paper

The agent tries tasks using the current skill, and those attempts are scored. More precisely, for a model M, a task x, a harness h, and a skill s, the agent runs the task and produces two things:
- A trajectory τ, which contains what the agent did during execution: messages, tool calls, observations, commands, and final answers.
- A score r, which measures how well the agent performed on that task.

Image Credit: SkillOpt original paper

As for the tasks, there is a training dataset for skills which has three data splits:
- Dtr: training tasks, used to collect experience and propose skill edits
- Dsel: selection or validation tasks, used to decide whether an edited skill is actually better
- Dtest: test tasks, used only at the end to report final performance
The training split Dtr is used to collect evidence about how the agent behaves with the current skill.
The agent runs a batch of tasks and produces rollout trajectories containing both successes and failures:
- Small batches are faster but noisier,
- Larger ones are costlier but expose more recurring patterns.
SkillOpt may discover that the agent searches the wrong source, skips verification, uses the wrong format, misses key fields, or reaches the right conclusion but reports it incorrectly.
These rollouts become evidence for improving the skill. SkillOpt can also accumulate multiple batches before updating a skill, giving the optimizer stronger evidence for making changes.
Image Credit: SkillOpt original paper
Then an optimizer analyzes the trajectories and generates candidate skills from the Dtr training split.
It splits them into successes and failures and reflects on them in smaller minibatches. This helps it find reusable procedural lessons. For instance, instead of learning “check column C in this spreadsheet,” it can learn a broader rule like “inspect column names and verify the relevant row or aggregation before answering spreadsheet questions.”
The optimizer then suggests small, controlled edits to the skill document. These agents like add, delete, or replace text instructions.
An edit is accepted only if it improves performance on a Dsel validation set:
- If the candidate improves the validation score, it becomes the new current skill.
- If it also has the best score so far, it can become the exported best_skill.md.
- If it doesn’t improve the score, it is rejected. But SkillOpt stores them in a rejected-edit buffer, so later optimizer can avoid repeating mistakes.
The final selected skill is then evaluated on the Dtest test split.
At deployment, the final result is only the best accepted skill. It is exported as a compact skill file, usually best_skill.md that doesn’t require any extra optimizer calls during inference. Remember that the deployed model remains unchanged.

If we put this workflow shortly, SkillOpt turns skill writing into a controlled optimization loop: run the agent, collect evidence, reflect on successes and failures, propose bounded text edits, validate them on held-out tasks, reject harmful changes, and export the best skill as a reusable artifact.

And one more important detail: SkillOpt also keeps track of what happens over longer training epochs. At the end of an epoch, it compares the old skill with the new one on the same tasks and checks what got better, what got worse, what still fails, and what keeps working. This helps the optimizer keep useful long-term lessons instead of constantly overreacting only to the latest batch.

Microsoft also limit how much the skill can change at once, keeps track of rejected edits so the optimizer can avoid similar mistakes later, and uses slower updates across training epochs to preserve useful patterns.

Performance results

Microsoft researchers have also evaluated SkillOpt across six benchmarks, seven target models, and three execution settings: direct chat, Codex, and Claude Code. This proved that SkillOpt is best or tied for best in all 52 tested settings:

On GPT-5.5, it improves average accuracy over using no skill by 23.5 points in direct chat, 24.8 points in the Codex agent loop, and 19.1 points in Claude Code.
The results also show that the optimized skills can transfer. A skill trained in one setting can still help other model sizes, other execution environments, or related benchmarks without further optimization. For example:
- In cross model transfer, GPT-5.4 skill improves GPT-5.4-mini on SpreadsheetBench: 36.1 → 45.5. It is +9 points.
- For cross-harness Codex-trained SpreadsheetBench skill transfers to Claude Code with the shift of 22.1 → 81.8 (+59.7 points).
- And finally, cross-benchmark test shows that OlympiadBench skill transfers to Omni-MATH on GPT-5.4: 56.6 → 60.3, +3.7.

Advantages vs. Limitations

Summarizing all the feature, SkillOpt gives interesting gains for skill self-improvement: it trains the instruction file around a frozen agent; keeps only validated improvements; produces compact reusable skills; and improves performance without fine-tuning the model.

Not without limitations! You also need to take into account while using this method that:

SkillOpt essentially needs scored tasks.
The optimizer needs many rollout and reflection calls before deployment, which is costly. Plus the quality of the skills depends on optimizer quality.
And maybe the most important thing: now SkillOpt has single-domain focus – it optimizes one skill for one target domain, not a broad skill library.

But there is another method that works concretely with skill libraries →

Send subscription as a gift

Maintaining skill libraries with SkillOps

Researchers from Emory University and University of Illinois Urbana-Champaign built a framework that looks at agent skills as a whole skill library that has to be maintained over time. It is called SkillOps (don’t confuse it with SkillOpt!) and it is focused on one problem: even when a single skill looks fine on its own, as agents use more and more skills, the library can slowly become messy and so harder to retrieve from, compose, and trust.

So while SkillOpt focuses on training one skill document, SkillOps helps when you need to manage a whole skill ecosystem.

Before an agent uses a skill library, SkillOps checks the library, diagnoses problems, repairs them, and returns a cleaner version of the library. The downstream agent doesn’t need to change its internal code. It can just use the maintained library instead of the raw one. And interestingly, SkillOps forms the library as a Hierarchical Skill Ecosystem Graph (HSEG). Let’s look at this pipeline more precisely.

How does SkillOps form Hierarchical Skill Ecosystem Graph?

The Hierarchical Skill Ecosystem Graph, or HSEG has two layers:

First of all, there is an Internal Skill Graph where SkillOps represents each skill as a structured contract with five parts:

P – preconditions: when the skill can be used
O – operation: what the skill does
A – artifact: what the skill produces
V – validator: how to check whether the output is correct
F – failure modes: known ways the skill can fail

This is important because SkillOps can check whether a skill is actually usable and compatible with other skills.

Then SkillOps organizes an External Graph-of-Graphs where skills are connected by different types of relations:

Dependency edges (dep), when one skill produces something another skill needs.
Compatibility edges (comp), where the output of one skill fits the input of another.
Redundancy edges (red), where two skills do the same thing.
Alternative edges (alt), if two skills solve the same goal in different ways.

Image Credit: The original paper

The main workflow of SkillOps is formed in two alternating loops: the Task-Time Loop and the Library-Time Loop.

The Task-Time Loop is for agent execution.

When a new task comes in, SkillOps retrieves candidate skill subgraphs from the library. Then it checks if these skills can work together: whether their preconditions are satisfied, whether one skill’s output can be used by the next skill, and whether their interfaces are compatible.

If something is missing, SkillOps can insert extra nodes into the plan. For example, it can add a validator to check if an intermediate output is correct, or an adapter to convert one skill’s output into the format required by another skill.

After that, the assembled skill subgraph is executed. If a skill fails during execution, SkillOps tries local repair: it can substitute the failed skill with an alternative one, repair the skill using the observed error trace, or record the failure if recovery is not possible.

The Library-Time Loop focuses on the health of the whole skill library.

It works after execution and mines skill contracts from execution logs, checking the library across five dimensions:

Utility – Is a skill useful in recent tasks?
Redundancy – Are there duplicate or near-duplicate skills?
Compatibility – Can connected skills safely work together?
Failure risk – Does a skill often break?
Validation gap – Does a skill lack a validator?

Based on this diagnosis, SkillOps applies maintenance actions. It can merge redundant skills, repair risky ones, retire low-utility skills, add validators, add adapters, or instantiate parameterized skills for specific task arguments.

So the full workflow is closer to a self-maintaining ecosystem: the Task-Time Loop uses the current library to solve the task, while the Library-Time Loop learns from execution logs and keeps the library cleaner, safer, and easier to reuse. A nice thing is that the current implementation is mostly rule-based – the system uses observable signals, and thanks to this the maintenance loop uses almost no extra LLM calls or tokens at library time.

Performance gains

So how does this method influence on the quality of agents?

In ALFWorld (a benchmark with multi-step household manipulation tasks), SkillOps achieves 79.5% task success, outperforming baseline with 8.8 percentage points, of course without task-time LLM calls.
As a standalone agent, SkillOps reached 79.5% task success with a 200-skill library.
SkillOps is a plug-in maintenance layer, and it also improves retrieval-heavy baselines.
- Hybrid Retrieval improved from 38.2% to 41.1% (+2.90 points)
- BM25 Only improved from 41.8% to 42.8%, or +1.00 point
- Dense Only showed from 32.3% to 33.4% upgrade
- SkillWeaver gained +2.46 points(41.3% → 43.8%)

The method also stayed stable as the library grew from 200 to 2000 skills. At the largest scale, SkillOps reached 80.5% task success, while the next-best baseline was still more than 31 points behind. But…

What about limitations and benefits?

SkillOps relies on structured skill contracts, which may not always exist in real deployments. Broader tests on real long-running agent logs are also needed. And one more thing: the rule-based maintenance loop is cheap, but it may miss deeper semantic conflicts that require stronger reasoning.

Anyway, SkillOps most strong advantage is that it treats a skill library like a managed software system. It helps keep the library cleaner, safer to compose, and easier for retrieval agents to use. What is also cool – SkillOps can be added as a plug-in layer, so existing agents can use the maintained library without changing their internal logic.

And finally, let’s look at a slightly earlier, but no less interesting, approach.

SkillMOO: Optimizing skill bundles for software engineering agents

Another related direction is SkillMOO developed by researchers from United Kingdom, Germany and China. It is a framework focused specifically on software engineering agents. Its main idea is that agent skills should not be optimized only for pass rate. A skill can improve task success but at the same time it can make the agent more expensive by adding extra context, tokens, and runtime.

SkillMOO treats skills as a multi-objective optimization problem. The goal is to find the skill bundle that gives the best tradeoff between pass rate and inference cost.

The workflow of SkillMOO

So, SkillMOO works with skill bundles – a task-specific sets of skills that coding agent use.

The method uses two agents:

A task solver agent, which tries to solve the coding task using a candidate skill bundle.
A skill optimizer agent, which looks at the results and proposes changes to the bundle.

Image Credit: SkillMOO original paper

The workflow starts with an initial population of candidate bundles. These are usually sampled as diverse subsets of the available skill pool. The task solver runs each candidate on the software task and records several signals:

pass rate – whether the produced solution passes the verifier tests
cost – how much inference costs with this bundle
runtime – how long the run takes
failure traces – what went wrong when the agent failed

Then the skill optimizer takes this evidence and proposes edits to the bundle, such as removing distracting or unnecessary skills, replacing a misaligned skill with a better one, adding missing task-specific guidance, reordering skills to place the most useful ones first, rewriting outdated skill content.

Each edited bundle becomes a new candidate and is evaluated again by the same task solver.

The important part is the selection step. SkillMOO uses NSGA-II, a classic multi-objective evolutionary algorithm, to select the best candidates, analysing two objectives at once – maximal pass rate and minimal inference cost. More precisely, SkillMOO searches for Pareto-efficient skill bundles: bundles where you can’t improve pass rate without increasing cost, or reduce cost without hurting pass rate.

And this is all about finding which skill bundle is worth deploying.

The key results, advantages and limitations

SkillMOO was tested on all 16 software engineering tasks from SkillsBench and achieved the top pass-rate rank on 11 of 12 non-zero-pass tasks. What is most important to mention SkillMOO improves pass rate by up to +21% points – or +131%! relative improvement – while reducing cost by up to 31.7%.

Its key takeaway is that better skill bundles are often not longer, but more minimal, focused, and cost-aware like removing peripheral or misaligned skills; replacing lower-priority skills; removing one redundant skill. Surprisingly, adding new skills did not improve pass rate in the filtered archive.

This again points at the main advantage of SkillMOO: it makes skill optimization cost-aware. Treating skills as deployment objects helps to avoid adding unnecessary extra skills.

However, since the evaluation is based on 16 software engineering tasks from one benchmark, broader validation is still needed. SkillMOO experiments used only one GLM-5 model, so it is still not clear how well the same results transfer to GPT, Claude, or other coding agents

But also there are some uncertaities. The experiments use one main model, GLM-5, so it is not yet clear how it will work with other models. Also, the edit-pattern analysis is descriptive, not causal: it shows which edits were associated with better results, but does not fully prove that those edit types always cause improvement.

Conclusion: The lessons for Skill Engineering

What can we take from all these methods? Here is the conclusion of their best use cases:

Use SkillOpt when you want to train and improve one specific skill for one clear domain.
Use SkillOps when you need to manage a whole skill ecosystem. You’ll get the skill library that will be treated and maintained like software infrastructure.
Use SkillMOO when you work with software engineering agents and care about the best quality and cost tradeoff.

The bigger lesson is that skills are becoming an engineering layer around agents that is trainable, maintainable, and optimizable part of the whole architecture. And the main todays tips are:

A skill can be trained as an external artifact and you don’t need to essentially change model weights for this.
Skill libraries need maintenance, because useful skills can still create technical debt when they are duplicated, outdated, incompatible, or poorly validated.
More skills are not always better. Sometimes the strongest skill set is smaller, cleaner, and more focused.
Skill engineering is about both adding guidance and removing noise.

It is a new direction that has just started to form its optimization stack. And we love it.

How did you like it?

Share the newsletter

FAQ

What is skill engineering?

Skill engineering is the process of designing, testing, optimizing, and maintaining reusable skills that guide how AI agents solve recurring tasks.

Skill engineering vs prompt engineering: what is the difference?

Prompt engineering improves a single request. Skill engineering creates reusable capability packages that agents can apply across many tasks.

When should you use SkillOpt?

Use SkillOpt when you want to improve one specific skill for a clear task domain using scored rollouts and validation.

SkillOpt vs SkillOps: what is the difference?

SkillOpt optimizes one skill document. SkillOps manages an entire skill library and reduces skill technical debt.

Why does SkillMOO matter?

SkillMOO shows that better agent skills are not always longer. For coding agents, smaller and more focused skill bundles can improve pass rate while reducing cost.

AI 101: From Prompt Engineering to Skill Engineering

Prompt Engineering vs Context Engineering vs Skill Engineering

What Is Skill Engineering for AI Agents?

SkillOpt: training one reusable agent skill

The workflow of SkillOpt

Performance results

Advantages vs. Limitations

Maintaining skill libraries with SkillOps

How does SkillOps form Hierarchical Skill Ecosystem Graph?

Performance gains

What about limitations and benefits?

SkillMOO: Optimizing skill bundles for software engineering agents

The workflow of SkillMOO

The key results, advantages and limitations

Conclusion: The lessons for Skill Engineering

How did you like it?

FAQ

What is skill engineering?

Skill engineering vs prompt engineering: what is the difference?

When should you use SkillOpt?

SkillOpt vs SkillOps: what is the difference?

Why does SkillMOO matter?

Reply

How Local AI Ecosystems Are Rewriting the Global AI Assistant Race

Reasoning Models Explained: o1, DeepSeek-R1 & Beyond

FOD#159: Is Graph Engineering Real? Why Everyone Is Talking About It