Quick answer: What is the β€œBeyond RL” fine-tuning stack for LLMs?

β€œBeyond RL” means post-training is no longer a single reinforcement learning stage. Modern LLM tuning combines multiple methods: instruction tuning (SFT), preference alignment (RLHF/DPO/RLVR), adapter-based parameter updates (LoRA family), and alternative optimizers like Evolution Strategies (ES). The shift is from one monolithic loop to a modular stack that is cheaper, more stable, and easier to adapt per task.

Core terms in this article (fast glossary):

  • What is SFT? Supervised Fine-Tuning on prompt-response pairs to teach assistant behavior.

  • What is RLHF? Reinforcement Learning from Human Feedback using preference data and reward optimization.

  • What is RLVR? RL with verifiable rewards (tests, execution checks, exact answers).

  • What is DPO? Direct Preference Optimization that learns preferred > rejected outputs without a full RL loop.

  • What is LoRA? Low-Rank Adaptation: freeze base weights, train small low-rank updates (W' = W + BA).

  • What is Doc-to-LoRA? Generate a LoRA adapter directly from a document (context -> parameters).

  • What is Text-to-LoRA? Generate a LoRA adapter directly from a task description (text -> parameters).

  • What is LoRA-Squeeze? Post/in-training compression of LoRA rank to reduce size with minimal quality loss.

  • What is Kron-LoRA? Structured Kronecker + low-rank adapters for stronger parameter efficiency.

  • What is MoA? Mixture of Adapters: route tokens across heterogeneous adapter types.

  • What is ES? Evolution Strategies: gradient-free optimization via parameter perturbation + reward evaluation.

  • What is LoRA + ES? Use LoRA as a compact search space and ES as the optimizer for scalable post-training.

Subscribe for weekly operator-grade AI systems analysis:
https://www.turingpost.com/subscribe

But first our travel plans: NVIDIA GTC in San Jose

Join us and other top developers, researchers, and business leaders at NVIDIA GTC to explore the next wave of AI innovation. A few of the sessions we’ll be joining:

  • Nemotron Unpacked: Build, Fine-Tune, and Deploy NVIDIA’s Open Models with Bryan Catanzaro on March 17 at 9 am PT

  • How We Scaled Kimi K2.5 [S81695] with Zhilin Yang, founder of Kimi, on March 17 at 11 am PT

  • and Open Models: Where We Are and Where We’re Headed, with a phenomenal lineup including Jensen Huang, Harrison Chase, Michael Truell, and Arthur Mensch, on March 18 at 12:30 pm PT

and don’t forget to say hi if you are at GTC in person!

Today, training is very, very RL-focused. The range of where reinforcement learning is used is enormous: alignment training optimizing models to match human preferences, AI evaluators, or verifiable outcomes; multi-step reasoning capability training; agent training, teaching models to plan actions, use tools, and operate in multi-step environments (coding agents, GUI agents); and even pre-training and system-level learning.

The motivation for using RL is largely practical, inspired by the success of many reasoning models. It allows us to optimize objectives as standard training losses and simply train models to maximize the defined reward signal.

But researchers are increasingly questioning whether reinforcement learning should remain the dominant post-training stage. Why?

  • Firstly, RL pipelines are expensive and unstable, they require careful tuning of reward models and policy updates, and often provide sparse feedback, especially for long reasoning tasks. Reward hacking is also a pain in the neck.

  • Just remember that Andrej Karpathy once said that β€œRL is terrible” because a single final reward spreads noisy credit across many irrelevant steps rather than identifying which parts of the reasoning actually matter.

So where exactly are the parts of a model that should be modified? How can we make post-training more efficient? The search for better training approaches is not over. RL’s limitations motivate exploring alternative post-training strategies in optimization and parameterization spaces, including parameter-efficient methods like LoRA (Low-Rank Adaptation) and gradient-free optimization approaches such as evolutionary methods.

Today we’ll discuss what can make post-training cheaper, more stable and dynamic, and more modular than RL-based pipelines. There is a lot to learn about the new LoRAs – Doc-to-LoRA, Text-to-LoRA, LoRA-Squeeze, Kron-LoRA and MoA (Mixture of Adapters) – plus a very interesting direction Evolution Strategies, and how they can fit together. Let’s unpack what is beyond RL.

In today’s episode:

  • Why RL remains powerful but can be costly and brittle in post-training

  • How Doc-to-LoRA and Text-to-LoRA change adapters from β€œtrained artifacts” to β€œgenerated modules”

  • How LoRA-Squeeze, Kron-LoRA, and MoA improve adapter efficiency and composition

  • Why Evolution Strategies (ES) are re-emerging as a serious gradient-free optimization path

  • How a LoRA + ES hybrid can make post-training cheaper, more modular, and more scalable

  • Conclusion

  • Sources and further reading

Essentials of Model Training

Before we move on to exploring new ideas for fine-tuning, we need to clarify what training pipelines we use today, and where we can place new alternatives.

When we talk about LLM training, we mean a lifecycle divided into two broad phases: pre-training and post-training. And the distinction is not about when something happens chronologically, but about what objective is being optimized and what capabilities are being added to the model.

  • Pre-training builds the model’s general intelligence substrate. The objective is to predict the next token in a sequence correctly. The result is a base model with learned general structure of language and knowledge from massive datasets. The most of computational cost lies here.

  • Post-training shapes that substrate into a usable system. It teaches the model how to behave, it is more flexible and variable, including different stages. And despite this, it is the cheaper part of training.

In general, a modern LLM pipeline can be seen as a following combination: pre-training β†’ supervised fine-tuning (SFT) β†’ alignment training β†’ post-training optimization.

In SFT, the model is trained on prompt–response examples to learn how to follow instructions and generate structured answers. This is about how to behave as an assistant.

Alignment training stage attempts to align model outputs with human preferences or task-specific objectives. The most common approach has been RLHF (Reinforcement Learning from Human Feedback), where humans rank model outputs, a reward model learns those preferences, scores model’s outputs with a reward function, and then RL updates the model to maximize reward using gradient-based optimization.

More and more systems now replace human feedback with automated evaluation signals, such as math solutions, code execution, database queries, programmatic tests and others, leading to RLVR (Reinforcement Learning with Verifiable Rewards).

On top of this, AI models often include additional improvements: retrieval augmentation (RAG), reasoning training to produce multi-step reasoning traces, tool use and agent behavior training, long-context optimization, safety and calibration training and more.

And today, we are witnessing Low-Rank Adaptation (LoRA) methods evolve into something new. Latest approaches like Doc-to-LoRA, Text-to-LoRA, LoRA-Sqeeze and others suggest that post-training is no longer just about running RL on top of a base model. It is becoming a modular stack, where some behaviors are optimized through rewards, while others are injected instantly through generated adapters. Fine-tuning starts to look less like one monolithic training stage and more like a dynamic system for composing capabilities.

Basically, LoRA changes how model parameters are updated during post-training. Early LoRA approaches treated adapters purely as fine-tuning mechanisms. They were trained on datasets to specialize a model for a task.

In standard LoRA fine-tuning, the base model is frozen and only small low-rank matrices are trained: W' = W + BA. Here, W is the original weight matrix and BA is a low-rank update whose rank is usually between 4 and 64. So instead of modifying billions of parameters, the training procedure modifies a few million.

In practice, LoRA is often used for domain specialization, instruction tuning, reasoning tasks and alignment experiments. So the typical workflow pipelines would be: Pretraining β†’ Instruction fine-tuning β†’ Alignment / capability training β†’ Parameter-efficient fine-tuning (LoRA).

Now let’s break down how the latest LoRA methods work and what new angles of post-training they reveal.

Doc-to-LoRA and Text-to-LoRA: Changing what LoRA represents

Recent works from Sakana AI on Doc-to-LoRA and Text-to-LoRA significantly expand the idea of generated adapters. They work similarly but in different directions:

Don’t settle for shallow articles. Learn the basics and go deeper with us. Truly understanding things is deeply satisfying.

Join Premium members from top companies like Microsoft, NVIDIA, Google, Hugging Face, OpenAI, a16z, plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on in AI.Β 

Reply

Avatar

or to participate

Keep Reading