Turing Post
Posts
AI 101: What matters for RL? Precision! Switching BF16 → FP16

AI 101: What matters for RL? Precision! Switching BF16 → FP16

How this switch in precision influences the reinforcement learning accuracy and stability and why everyone, including Andrej Karpathy and Nathan Lambert, paid attention to this method

Ksenia Se & Alyona Vert.
November 05, 2025

You might have heard this from us before, but sometimes when you can’t find a solution, it helps to look back. That’s exactly what a team from Sea AI Lab and the National University of Singapore did when facing a long-standing reinforcement learning (RL) problem.

It’s a well-known issue researchers have been trying to overcome: models often behave differently during training than they do in real use. The gap shows up in the numbers that represent the policy’s decisions – they don’t match between training and inference, which makes RL fine-tuning unstable.

Instead of designing new algorithms or adding complex adjustments, the team revisited something fundamental: numerical precision. They discovered that a simple shift back from the newer BF16 format to the older FP16 could restore stability and consistency.

Today we’ll explore what BF16 and FP16 actually are, how precision affects RL fine-tuning, and why this seemingly small change has caught the attention of developers everywhere – including Andrej Karpathy, who used it for nanochat.

In today’s episode, we will cover:

The origins of RL instability: why it all comes down to precision
What is FP16 precision and why does it really help?
Results of the BF16 → FP16 switch
Advantages of using FP16 precision format
Not without limitations
Early cases of implementation
Conclusion
Sources and further reading

The origins of RL instability: why it all comes down to precision

In many RL setups used for fine-tuning large models, training and inference run on different computation paths:

one for training to compute gradients are update model parameters
one for inference, when the model generates text

In theory, both engines should behave the same and produce the same mathematical results. But in practice, small numerical differences emerge because of rounding errors and hardware optimizations. This causes what’s called a training-inference mismatch – which is a major source of RL instability.

This mismatch leads to two main problems:

Biased gradient: When the model trains, it tries to achieve higher rewards. The problem is that it learns from samples generated in inference mode, where numbers are handled slightly differently – enough to throw off the gradient that guides each step of learning. Hence – biased gradient.
Deployment gap: After training, the model used for text generation is not exactly the same as the one optimized during training. This happens because the parameter that perform best during training may not be optimal during deployment, so performance drops.

To address this mismatch between training and inference, researchers often turn to a technique called importance sampling. This technique reweights each sample’s contribution by the ratio between the training and inference probabilities to keep the gradient estimate unbiased. Some researchers tried to fix the mismatch through engineering rather than algorithmic changes – for example, using higher precision such as FP32 (32-bit floating point) for certain layers or manually aligning the training and inference code paths – but these adjustments still failed to prevent training collapse. Many approaches still ended up optimizing model for the training engine rather than the inference one, but that’s not what we need.

The problem remains a problem: training and inference operate differently.

Researchers from Sea AI Lab and the National University of Singapore aimed to find the root cause of this mismatch, and it appeared to be floating-point precision. But why?

During RL fine-tuning, the model’s policy is updated through numerical calculations such as floating-point operations and probability values.
During inference, the same policy is used, but it isn’t updated – it simply runs forward to generate outputs.
If the numerical precision (the level of detail in how numbers are represented) isn’t consistent, the policy can behave slightly differently across these two stages.

A floating-point number (or “float”) is a way computers store real numbers, for example, 3.14 or 0.001, using bits. Some bits represent the value’s size (the exponent), and others represent its precision (the fraction). The more bits used, the more precisely the number can be stored and calculated.

Most RL fine-tuning methods now use a format called BF16 (bfloat16, or brain floating point). It’s a 16-bit floating-point format that keeps the same wide range of values as 32-bit floats but uses fewer bits for precision – 16 bits total:

1 bit for the sign (positive or negative),
8 bits for the exponent (range of values),
7 fraction or mantissa bits (precision) (watch for the mantissa bits!)

Image Credit: ZipNN: Lossless Compression for AI Models

But again, the problem is that BF16 introduces rounding errors that cause small deviations that further lead to errors in training and inference policy match.

To fix it, researchers from Sea AI Lab and the National University of Singapore offered a method that turned out to be surprisingly simple – just switch from BF16 to earlier FP16 format during RL fine-tuning. Mindblowing 🙂

Now, let’s look closer at why such a small change could make such a big difference – and how it fixed what years of patches couldn’t.

What is FP16 precision and why does it really help?

Join Premium members from top companies like Microsoft, Nvidia, Google, Hugging Face, OpenAI, a16z, plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on with AI. Learn the basics and go deeper👆🏼

Reply

or to participate.