This website uses cookies

Read our Privacy policy and Terms of use for more information.

Quick answer: What is the “Beyond RL” fine-tuning stack for LLMs?

“Beyond RL” means post-training is no longer a single reinforcement learning stage. Modern LLM tuning combines multiple methods: instruction tuning (SFT), preference alignment (RLHF/DPO/RLVR), adapter-based parameter updates (LoRA family), and alternative optimizers like Evolution Strategies (ES). The shift is from one monolithic loop to a modular stack that is cheaper, more stable, and easier to adapt per task.

Core terms in this article (fast glossary):

  • What is SFT? Supervised Fine-Tuning on prompt-response pairs to teach assistant behavior.

  • What is RLHF? Reinforcement Learning from Human Feedback using preference data and reward optimization.

  • What is RLVR? RL with verifiable rewards (tests, execution checks, exact answers).

  • What is DPO? Direct Preference Optimization that learns preferred > rejected outputs without a full RL loop.

  • What is LoRA? Low-Rank Adaptation: freeze base weights, train small low-rank updates (W' = W + BA).

  • What is Doc-to-LoRA? Generate a LoRA adapter directly from a document (context -> parameters).

  • What is Text-to-LoRA? Generate a LoRA adapter directly from a task description (text -> parameters).

  • What is LoRA-Squeeze? Post/in-training compression of LoRA rank to reduce size with minimal quality loss.

  • What is Kron-LoRA? Structured Kronecker + low-rank adapters for stronger parameter efficiency.

  • What is MoA? Mixture of Adapters: route tokens across heterogeneous adapter types.

  • What is ES? Evolution Strategies: gradient-free optimization via parameter perturbation + reward evaluation.

  • What is LoRA + ES? Use LoRA as a compact search space and ES as the optimizer for scalable post-training.

Subscribe for weekly operator-grade AI systems analysis:
https://www.turingpost.com/subscribe

But first our travel plans: NVIDIA GTC in San Jose

Join us and other top developers, researchers, and business leaders at NVIDIA GTC to explore the next wave of AI innovation. A few of the sessions we’ll be joining:

  • Nemotron Unpacked: Build, Fine-Tune, and Deploy NVIDIA’s Open Models with Bryan Catanzaro on March 17 at 9 am PT

  • How We Scaled Kimi K2.5 [S81695] with Zhilin Yang, founder of Kimi, on March 17 at 11 am PT

  • and Open Models: Where We Are and Where We’re Headed, with a phenomenal lineup including Jensen Huang, Harrison Chase, Michael Truell, and Arthur Mensch, on March 18 at 12:30 pm PT

and don’t forget to say hi if you are at GTC in person!

Today, training is very, very RL-focused. The range of where reinforcement learning is used is enormous: alignment training optimizing models to match human preferences, AI evaluators, or verifiable outcomes; multi-step reasoning capability training; agent training, teaching models to plan actions, use tools, and operate in multi-step environments (coding agents, GUI agents); and even pre-training and system-level learning.

The motivation for using RL is largely practical, inspired by the success of many reasoning models. It allows us to optimize objectives as standard training losses and simply train models to maximize the defined reward signal.

But researchers are increasingly questioning whether reinforcement learning should remain the dominant post-training stage. Why?

  • Firstly, RL pipelines are expensive and unstable, they require careful tuning of reward models and policy updates, and often provide sparse feedback, especially for long reasoning tasks. Reward hacking is also a pain in the neck.

  • Just remember that Andrej Karpathy once said that “RL is terrible” because a single final reward spreads noisy credit across many irrelevant steps rather than identifying which parts of the reasoning actually matter.

So where exactly are the parts of a model that should be modified? How can we make post-training more efficient? The search for better training approaches is not over. RL’s limitations motivate exploring alternative post-training strategies in optimization and parameterization spaces, including parameter-efficient methods like LoRA (Low-Rank Adaptation) and gradient-free optimization approaches such as evolutionary methods.

Today we’ll discuss what can make post-training cheaper, more stable and dynamic, and more modular than RL-based pipelines. There is a lot to learn about the new LoRAs – Doc-to-LoRA, Text-to-LoRA, LoRA-Squeeze, Kron-LoRA and MoA (Mixture of Adapters) – plus a very interesting direction Evolution Strategies, and how they can fit together. Let’s unpack what is beyond RL.

In today’s episode:

  • Why RL remains powerful but can be costly and brittle in post-training

  • How Doc-to-LoRA and Text-to-LoRA change adapters from “trained artifacts” to “generated modules”

  • How LoRA-Squeeze, Kron-LoRA, and MoA improve adapter efficiency and composition

  • Why Evolution Strategies (ES) are re-emerging as a serious gradient-free optimization path

  • How a LoRA + ES hybrid can make post-training cheaper, more modular, and more scalable

  • Conclusion

  • Sources and further reading

Essentials of Model Training

Before we move on to exploring new ideas for fine-tuning, we need to clarify what training pipelines we use today, and where we can place new alternatives.

When we talk about LLM training, we mean a lifecycle divided into two broad phases: pre-training and post-training. And the distinction is not about when something happens chronologically, but about what objective is being optimized and what capabilities are being added to the model.

  • Pre-training builds the model’s general intelligence substrate. The objective is to predict the next token in a sequence correctly. The result is a base model with learned general structure of language and knowledge from massive datasets. The most of computational cost lies here.

  • Post-training shapes that substrate into a usable system. It teaches the model how to behave, it is more flexible and variable, including different stages. And despite this, it is the cheaper part of training.

In general, a modern LLM pipeline can be seen as a following combination: pre-training → supervised fine-tuning (SFT) → alignment training → post-training optimization.

In SFT, the model is trained on prompt–response examples to learn how to follow instructions and generate structured answers. This is about how to behave as an assistant.

Alignment training stage attempts to align model outputs with human preferences or task-specific objectives. The most common approach has been RLHF (Reinforcement Learning from Human Feedback), where humans rank model outputs, a reward model learns those preferences, scores model’s outputs with a reward function, and then RL updates the model to maximize reward using gradient-based optimization.

More and more systems now replace human feedback with automated evaluation signals, such as math solutions, code execution, database queries, programmatic tests and others, leading to RLVR (Reinforcement Learning with Verifiable Rewards).

On top of this, AI models often include additional improvements: retrieval augmentation (RAG), reasoning training to produce multi-step reasoning traces, tool use and agent behavior training, long-context optimization, safety and calibration training and more.

And today, we are witnessing Low-Rank Adaptation (LoRA) methods evolve into something new. Latest approaches like Doc-to-LoRA, Text-to-LoRA, LoRA-Sqeeze and others suggest that post-training is no longer just about running RL on top of a base model. It is becoming a modular stack, where some behaviors are optimized through rewards, while others are injected instantly through generated adapters. Fine-tuning starts to look less like one monolithic training stage and more like a dynamic system for composing capabilities.

Basically, LoRA changes how model parameters are updated during post-training. Early LoRA approaches treated adapters purely as fine-tuning mechanisms. They were trained on datasets to specialize a model for a task.

In standard LoRA fine-tuning, the base model is frozen and only small low-rank matrices are trained: W' = W + BA. Here, W is the original weight matrix and BA is a low-rank update whose rank is usually between 4 and 64. So instead of modifying billions of parameters, the training procedure modifies a few million.

In practice, LoRA is often used for domain specialization, instruction tuning, reasoning tasks and alignment experiments. So the typical workflow pipelines would be: Pretraining → Instruction fine-tuning → Alignment / capability training → Parameter-efficient fine-tuning (LoRA).

Now let’s break down how the latest LoRA methods work and what new angles of post-training they reveal.

Doc-to-LoRA and Text-to-LoRA: Changing what LoRA represents

Recent works from Sakana AI on Doc-to-LoRA and Text-to-LoRA significantly expand the idea of generated adapters. They work similarly but in different directions:

Don’t settle for shallow articles. Learn the basics and go deeper with us. Truly understanding things is deeply satisfying.

Join Premium members from top companies like Microsoft, NVIDIA, Google, Hugging Face, OpenAI, a16z, plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on in AI. 

Doc-to-LoRA focuses on memory, and Text-to-LoRA is created for better adaptation to new tasks. Instead of training adapters through gradient updates, a system generates them directly from text inputs: a task description or a document is converted into LoRA weights that modify the base model. This changes the role of LoRA, turning it into a more modular and dynamic fine-tuning stack, where models can acquire new knowledge, skills, and behaviors on demand.

With Doc-to-LoRA (D2L) the document becomes a parameter update. It turns a document into a LoRA adapter in one forward pass. It uses a hypernetwork that has been meta-trained to approximate context distillation.

[Just a quick reminder, context distillation is a method where a model reads information in context once and then learns to answer related questions later without the original document, because the knowledge has been transferred into its parameters.]

So you give the hypernetwork a new context, and it generates a context-specific LoRA adapter for a frozen base LLM. After that, the base model can answer follow-up questions without re-reading the original document.

Image Credit: Doc-to-LoRA original paper

Over time, the hypernetwork learns a general mapping from context to LoRA adapter. This effectively turns documents into reusable parameter modules, that can be stored, shared and attached to different models.

Architecturally, D2L is based on a Perceiver-style hypernetwork.

Image Credit: Turing Post

It takes hidden activations from the frozen LLM and maps variable-length contexts into a fixed-shape adapter. This is important, because documents can be short or very long, and for longer contexts, D2L uses chunking.

All together this gives D2L the following advantages:

  • It can process contexts much longer (more than 4x) than the base model’s native context window.

  • D2L works faster and reduces latency for follow-up queries, because internalization happens in a single forward pass instead of many gradient updates and the model no longer has to repeatedly consume the whole document.

  • It reduces KV-cache memory usage during inference.

  • It enables more dynamic use cases like frequent knowledge updates or personalized behavior.

Of course, D2L is not a perfect replacement for full long-context modeling. Its quality still depends on how well the hypernetwork learns the mapping from context to adapter, and there can be information loss compared to direct access to the full document. But as a practical mechanism for fast context internalization, D2L is a very interesting direction. Knowledge becomes something that can be mounted onto the model rather than injected through prompts.

Text-to-LoRA (T2L) follows the same general direction as Doc-to-LoRA but for tasks: adapters no longer need to be trained for each task – they can be generated directly from text.

In T2L, a natural-language task description is converted into LoRA weights that modify a frozen base model. This method also uses a hypernetwork. It reads the task description, encodes it into an embedding, and produces the low-rank LoRA matrices for different layers of the Transformer. Like in D2L, the adapter is generated in a single forward pass and attached to the base model.

Image Credit: Text-to-LoRA original paper

During training, T2L learns from a large distribution of tasks, and it either reconstructs existing task-specific LoRAs or is trained end-to-end with supervised fine-tuning across many tasks. Over time the hypernetwork learns a shared adaptation mechanism across tasks.

This approach unpacks a few important practical implications:

  • Adapters can be generated instantly, without running a training loop.

  • Since the input is a textual description rather than a fixed task ID, T2L can generate LoRAs for unseen tasks given only a textual description.

  • T2L also acts as a compression mechanism. Many LoRAs can be compressed into one hypernetwork that regenerates them when needed, storing many task adaptations inside one model.

In experiments, adapters generated this way improved performance across reasoning, QA, coding and knowledge benchmarks, often approaching the performance of fully trained task-specific LoRAs. But at the same time the quality strongly depends on the task description.

In the long run, D2L and T2L, as two complementary update generators, enable a range bunch of benefits – continual learning, rapid personalization, and reusable parameter modules without repeatedly retraining the whole model. This turns models into more dynamic learners.

Beyond that, several other interesting LoRAs further strengthen the adapter-based post-training stack beyond RL. Let’s look at them more closely.

Compression: Google DeepMind’s LoRA-Squeeze

What if we could make LoRA adapters even smaller and didn’t have to choose the adapter rank from the start? Specifically for this purpose, Google DeepMind created LoRA-Squeeze, which compresses parameter modules smaller, makes them easier to reuse, and easier to deploy. And you don’t need to choose the rank in advance – you can first train a LoRA adapter with a higher rank and then compress it to a smaller rank.

After fine-tuning stage, LoRA-Squeeze reconstructs the full weight update matrix and applies Randomized Singular Value Decomposition (RSVD) to produce a lower-rank LoRA module that preserves the most important components of the update. This compression can happen after training (Post-Squeeze), gradually during training (In-Squeeze), where the rank is progressively reduced while the model continues to learn via an annealing schedule, or periodically during training (Cont-Squeeze).

Image Credit: LoRA-Squeeze original paper

After applying LoRA-Sqeeze you can see some useful positive effects:

  • Lower-rank adapters often perform as well as or better than directly trained ones.

  • Less hyperparameter tuning is needed, because the training rank and deployment rank are decoupled.

  • And, of course, easier deployment due to smaller, more uniform LoRA modules.

This is really another level of lightweight parameter edits in post-training. But don’t forget: if you compress too aggressively, performance may collapse.

Kron-LoRA’s compression with structure

Another compression approach to LoRA primarily focuses on avoiding performance collapse during compression. Kron-LoRA from Cornell University uses roughly 25–30% of LoRA’s parameters – it is up to 4× fewer – while still matching or slightly exceeding LoRA on reasoning benchmarks.

It does this through adding Kronecker structure, representing a large matrix as a Kronecker product of two smaller matrices.

Image Credit: Turing Post

This creates a structured pattern in the big matrix that can be stored and computed much more efficiently. Then Kron-LoRA applies a LoRA-style low-rank decomposition to one of the factors.

This way, the adapter keeps a structured, repeated pattern in the weight update while still benefiting from compression. The result is much smaller adapters with modest memory saving (with only a small 5–8% speed overhead), better scalability for multi-task and sequential fine-tuning, and stronger robustness in some continual learning settings.

Image Credit: Kron-LoRA original paper

Stacking Multiple LoRA Types: Mixture of Adapters (MoA)

The final LoRA method that we want to discuss today extends the LoRA-based fine-tuning idea in a different direction, actually bringing more adapters together. Developed by Zhejiang University and Tencent, Mixture of Adapters (MoA) builds a mixture of heterogeneous Parameter-Efficient Fine-Tuning (PEFT) adapters, instead of relying on many identical LoRA experts. The point is not just to add more experts, but to add different kinds of experts with complementary capacities. Here is how it works more precisely:

Image Credit: MoA original paper

  • MoA combines different adapter types inside each Transformer layer, including LoRA modules, parallel adapters, and prompt tuning.

  • A token-level router then decides how much each expert should contribute (similar to a standard Mixture-of-Experts (MoE) concept).

  • In Soft MoA version, all experts are combined with learned sigmoid weights, while in Sparse MoA, only the experts whose contribution passes a threshold are actually activated, which reduces unnecessary computation.

So there is no “one adapter type everywhere,” there are multiple adapter types cooperating dynamically, encouraging actual specialization and making better use of the frozen model’s pre-trained knowledge. And the benefits of this are also important:

  • Better specialization means more adaptation per trainable parameter.

  • MoA reduces expert redundancy, which is a common weakness of homogeneous MoE-LoRA methods.

  • In the sparse version, efficiency rises, as it cuts computation by activating only the experts that matter for a given token.

In experiments, Soft MoA achieved the best overall math accuracy, while Sparse MoA kept almost the same performance with even fewer active experts and lower memory.

The importance of MoA goes beyond just another LoRA variant. The future of fine-tuning will not depend only on better RL-based optimizers, but also on how we design many adaptation modules themselves. MoA suggests a post-training stack that becomes more modular, compositional, selective, and specialized, where different adapter types can be combined on demand to handle different capabilities.

“Beyond RL” is not only about adapters. Alternative optimization ideas are also a part of a broader landscape of LLM fine-tuning methods. Today we offer you to look at one very promising concept, proposed by Cognizant AI Lab that moves beyond gradient-based optimization.

What are Evolution Strategies (ES)?

Evolution Strategies (ES) come from evolutionary optimization, inspired by natural selection. The idea is not to compute gradients through backpropagation, but to estimate the optimization direction from random parameter perturbations. This estimation identifies which direction in parameter space improves the reward the most. The ES process looks like this:

  1. Start with a pretrained LLM.

  2. Create a small population of models by adding random perturbations to the parameters.

  3. Each perturbed model generates answers for a task (e.g., reasoning or math).

  4. Outputs are scored using a reward function or verifier.

  5. Model parameters are updated in the direction of perturbations that achieved higher rewards.

  6. The process repeats over many iterations.

Since ES relies only on reward evaluations rather than gradients, it belongs to the class of zeroth-order optimization methods.

ES has several advantages compared to RL-based fine-tuning:

  • First of all, it is gradient-free. ES works even when the objective is non-differentiable, for example human feedback, evaluation metrics, external tools.

  • It works well with long-horizon or sparse rewards.

  • Less reward hacking and robustness to noisy rewards: Since ES averages many samples, it optimizes a population of models rather than a single policy, which makes exploiting the reward function harder.

  • It is easy to parallelize, because each perturbed model can be evaluated independently across GPUs.

  • Additionally, it doesn’t require backpropagation, saving GPU memory.

  • It is also more stable, because its runs show much lower variance than RL training.

And last but not least, ES can actually scale to billion-parameter models, which was previously thought to be infeasible.

If we look at the performance gains, on the Countdown benchmark, ES outperformed RL baselines: Qwen-2.5-3B improved from 10.0% to 60.5% with ES, compared to 32.5% with GRPO; and Llama-3.1-8B improved from 8.1% to 61.2%, versus ~51% with RL.

ES also showed strong results on math benchmarks (AIME 2024, MATH500, AMC, OlympiadBench). It produced large gains on puzzle tasks like ARC-AGI (0.2% → 29.5%) and Sudoku (2.5% → 69.5%).

Image Credit: Evolution Strategies original paper

However, ES still has open challenges in the face of many forward evaluations, which can be computationally expensive, and the imminent dependency on the quality of the reward signal.

As for the training cycle, gradient-free ES sits at the optimization step of post-training, replacing gradient-based training in a workflow like this: Pretraining → Post-training objective defined → Optimization method (ES instead of gradient descent).

And here’s where it gets interesting: what if we mix LoRA and ES?

LoRA + ES – an alternative hybrid

Both LoRA and ES belong to post-training, but they operate at different layers of the training process:

  • LoRA and generated adapters are responsible for parameterization.

  • Evolution Strategies are used for optimization.

What’s interesting here is that LoRA introduces a very natural place for ES optimization to operate. Their connection reinforces each other, because you can simply change the optimization target and apply ES only to the LoRA parameters, ignoring the rest of the model.

The new workflow pipeline becomes straightforward:

  1. Start with a pretrained model.

  2. Base LLM weights are frozen.

  3. Attach LoRA adapters to selected layers. Now ES can search directly in the LoRA parameter space.

  4. Apply ES to LoRA and sample many LoRA perturbations.

  5. Evaluate the model on a task.

  6. Keep the variants that perform better and update the LoRA adapters accordingly.

ES performance depends heavily on dimensionality of the parameter space, and LoRA narrows it down from about 7B–70B parameters in full LLM to about 1M–20M parameters. Scaling of fine-tuning can explode dramatically. This becomes much more attractive and practical than using just ES or just any LoRA separately.

LoRA shows another layer where evolution can operate – systems can evolve small adapter modules that encode reasoning strategies, domain knowledge, or alignment policies. Because these modules are small, many variations can be tested quickly, and the search process becomes far more efficient. This is exactly the kind of setting where zeroth-order optimization methods such as Evolution Strategies work well. Useful updates lie in a low-dimensional subspace and LoRA explicitly constructs such a subspace.

Conclusion: What do we have today for post-training?

To sum up, here is the full stack of methods we can use at the post-training stage to give models the guidance they need to work properly.

  1. The simplest and still extremely common Supervised Fine-Tuning (SFT). It trains the model on instruction-response datasets, optimizing with standard gradient descent. It is indeed simple and stable but has limited alignment ability.

  2. The most dominant paradigm now – reinforcement learning (RL), especially RLHF and RLVR with different algorithms like PPO, GRPO, GSPO and others.

  3. Direct Preference Optimization (DPO) which is an alternative to RLHF that directly optimizes preferred response > rejected response. It gives simpler training and avoids unstable RL loops.

  4. Evolution Strategies (ES) – the Cognizant AI Lab approach, improving models by randomly perturbing parameters, evaluating performance, and updating the model toward better-performing variants, rather than using gradient-based optimization.

  5. Parameter-Efficient Fine-Tuning (PEFT) with advanced LoRA methods that allow to update only small essential parts of model instead of the entire model. Approaches that we explore today demonstrate that new LoRAs more effectively reduce cost and even add new capabilities.

And finally, you can mix all of this into hybrid forms, which is what is most often done in practice – for example, SFT → DPO → agentic self-improvement with RAG added on top, or a new proposed alternative with LoRA + ES, where LoRA defines a compact parameter subspace and Evolutionary Strategies optimization explores that space without requiring gradients, searching only for the best adapter parameters.

If we step back and look at all of this together, the new ideas around LoRA and optimization strategies like ES suggest a change in how language models are developed overallю Previously, the model itself was the product and training produced a finished system. Now we can see that the base model increasingly functions as a platform. Capabilities are added through modular components: adapters, retrieval systems, tool integrations, and optimization loops. Everything turns into a process of designing the ecosystem around the model.

In that ecosystem LoRA, for example, provides the parameter subspace, Evolution Strategies act as the optimization mechanism, and adapter generation that we discussed today introduce new ways to convert knowledge into modules, stack them together and shrink to effective ranks. This grows into a system where capabilities can evolve and recombine over time without retraining the entire model – and this is the new type of fine-tuning stack for AI models.

The main lesson here is this: there are many ways to post-train models more effectively, and we can’t afford to overlook them.

Sources and further reading

  • Doc-to-LoRA: Learning to Instantly Internalize Contexts | paper

  • Text-to-LoRA: Instant Transformer Adaption | paper

  • Instant LLM Updates with Doc-to-LoRA and Text-to-LoRA | blog post

  • LoRA-Squeeze: Simple and Effective Post-Tuning and In-Tuning Compression of LoRA Modules | paper

  • Kron-LoRA: Hybrid Kronecker-LoRA Adapters for Scalable, Sustainable Fine-tuning | paper

  • MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient
    Fine-Tuning of Large Language Models | paper

  • Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning | paper

Resources from Turing Post

Reply

Avatar

or to participate

Keep Reading