someday soon something smarter than the smartest person you know will be running on a device in your pocket, helping you with whatever you want.
this is a very remarkable thing.
Feels like a plot twist.
Feels like a comeback.
Feels like the beginning of something big.
That’s how Hugging Face CEO Clem Delangue described the GPT-OSS launch, and in a week that also saw Anthropic drop Claude 4.1 Opus and DeepMind unveil the jaw-dropping Genie 3 world simulator, OpenAI still managed to steal the spotlight. They just dropped GPT-OSS, a family of powerful, permissively licensed open-weight models that immediately shot to the #1 trending spot on Hugging Face.
We all know that for the past year, the most exciting action in open-source AI has come from Chinese labs like Qwen, DeepSeek, Moonshot AI and others (see our recent breakdown). That’s why OpenAI’s re-entering the arena is so important. After six months of quiet collaboration with HF, they’ve released two highly capable models that are already shaking up the ecosystem. That’s why we’re also changing our editorial schedule and, instead of time-test compute, covering the one and only GPT-OSS.
The most shocking thing, of course, is that OpenAI finally noticed the “open” part in their name. The most admirable thing is that they’re setting an example for other closed-source American companies. But in the community, there’s also a whirlwind of conflicting takes, that beneath the surface of the "gift" lies a complex and calculated strategy. Is this a genuine return to the open-source ethos? Is it a strategic gambit to reclaim the narrative? Or a trick to lock developers into the OpenAI paradigm? Well, all of it and more.
In today’s episode, we will cover:
The Plot Twist: Why Is OpenAI Releasing Open Models Now?
The Models: What’s Under the Hood of GPT-OSS
Harmony: The New Prompting Standard?
Hands-On: Does It Actually Work (and How)?
Where to use it + Installation
The Performance Paradox: A Spiky, Brittle Genius?
Reclaiming the Open-Source Crown?
Safety, Red Teaming, and Worst-Case Scenarios
GPT-OSS: Official Results (more evals needed)
Conclusion: Your Guide to the GPT-OSS Family
Sources and further reading
Why Did OpenAI Release GPT-OSS Open-Weight Models
The story, as told by Clem Delangue, began six months ago when Sam Altman at AI Action Summit in Paris declared they were serious about open source. It was a statement many found hard to believe. OpenAI’s journey from a non-profit research lab to the titan of proprietary AI is the stuff of Silicon Valley legend. Their last major open-weight language model release was GPT-2, an eternity ago in AI time.
So, why the change of heart?
There are a few compelling reasons. First, it was an Action Summit after all – and just in January, DeepSeek R1 gave everyone a good kick in the butt for not acting toward openness. By showing its full chain-of-thought and using a permissive license, DeepSeek set a new standard for transparency and trust in reasoning models. And then, like mushrooms after rain, other Chinese models followed suit. Chinese! What a shame to the states. Action was really needed.
That brings us to the geopolitical angle. With growing calls for strong American open-source AI foundations to counterbalance the momentum from China, who better to deliver than the startup that has led the field? It’s a move to ensure the US remains at the forefront of what has become a global open-source race. Almost weird they didn’t call the model GPT-USA 🇺🇸.
They couldn’t do it alone, though. OpenAI doesn’t exactly enjoy the warmest reputation among developers these days. Hugging Face, on the other hand, is everyone’s darling. It was the obvious partner to turn to. In the world of AI, vibes matter – and Hugging Face brings the kind that makes open source feel like a movement, not a memo. Plus, their strong roots in Europe and support from figures like Yann LeCun made the move even more noticeable.
Third, as Nathan Lambert suggests, this is a strategic “scorched earth policy.” By releasing a powerful open model that undercuts the performance of their own o4-mini API and other competitors, OpenAI could be clearing the lower end of the market ahead of a future GPT-5 release – hoping to capture the premium tier.
What Is GPT-OSS? Architecture & Key Specs
GPT-OSS are large-scale language models based on the same ideas behind GPT-2 and GPT-3 but improved with Mixture-of-Experts (MoE) transformer architecture to activate only a part of the model at any time, saving memory and making training more efficient. Both come with a permissive Apache 2.0 license, a 128k context window, and full access to the chain-of-thought.
A Selective Step Toward Openness
From what we understand: OpenAI did not release the base models. What they’ve shared are the final, instruction-tuned, safety-aligned versions.
This distinction matters. By keeping the base models and training data private, OpenAI is safeguarding its core intellectual property – the foundational elements that underpin its competitive edge. In effect, they’ve provided the car, but not the blueprints to the factory. This approach allows them to foster community engagement and adoption while maintaining control over the essential ingredients of their technology. It’s a strategic decision that signals a measured, rather than absolute, commitment to openness. Which is understandable when you need to make profit.
Open AI presented two versions of GPT-OSS:
GPT-OSS-120B: A large model with 36 transformer layers, ~117 billion total parameters, where only ~5.1B are used per token during each step. It has 128 experts per MoE block.
GPT-OSS-20B: A smaller version with 24 layers, ~21 billion total parameters, and 3.6B parameters active per token. It includes 32 experts per block.
Thanks to quantization that shrank the size of the MoE layers – reducing their precision to about 4.25 bits per parameter – the gpt‑oss‑120B can run efficiently on a single 80 GB GPU. The smaller gpt‑oss‑20B operates on just 16 GB of memory, making it accessible even on edge devices or laptops with 32 GB of RAM (though performance is slower without GPU acceleration). It is a bit jarring to see the hardware gap of 80 GB versus 16 GB – especially since many top-tier laptop models assume ~64 GB for powerful local inference. That said, these models make local deployment and rapid iteration feasible without requiring costly infrastructure.
Let’s look more precisely at the main components of GPT-OSS’s MoE architecture:
The architecture is a fascinating mix of modern trends and throwbacks. As Sebastian Raschka put it: “The first surprising fun fact is they used bias units in the attention mechanism like ye goode old GPT-2. Super interesting, I haven't seen any other architecture doing that since then!”
It processes inputs using a hybrid attention mechanism, alternating dense and locally-banded sparse patterns, paired with Grouped Query Attention (GQA) with 8 key-value heads shared across query heads (each transformer layer has 64 attention heads), and Rotary Positional Embeddings (RoPE), enabling fast inference and long-context reasoning up to 131,072 tokens.
Root Mean Square Normalization (RMSNorm) is applied before each attention and MoE block. It normalizes input vectors by dividing them by their root mean square (RMS) value instead of using the mean, and so stabilizes training by keeping the scale of activations consistent.
For each token, a router picks the top 4 experts, whose outputs are then combined using softmax-weighted routing.
GPT-OSS uses a customized version of SwiGLU, a non-linear activation function, with extra features, such as:
Clamping which limits the range of output values to avoid extreme activations, and
Residuals connection that adds the input back to the output of the activation for better gradient flow.
These changes stabilize training dynamics.
GPT-OSS uses Pre-LN (layer normalization before the block), similar to GPT-2 and other modern variants of model.
It also adds bias terms in the softmax denominator to let the model "ignore" irrelevant tokens when needed.
Another interesting feature is that in GPT-OSS OpenAI a custom tokenizer, as many advanced models now do. This custom tokenizer is called o200k_harmony, which is:
Based on Byte Pair Encoding (BPE) that breaks words into smaller pieces based on how often parts of words appear together
Extended from OpenAI’s existing tokenizer used in GPT-4o, o4-mini, etc. (now we know more about these models too!)
Designed to work well with chat-style formatting and contains 201,088 tokens.
Fully open-sourced via the TikToken library.
As GPT-OSS models are designed to offer high reasoning performance, strong tool-use capabilities, and efficient deployment, so pre-training and post-training stages are focused on gaining exactly these capabilities.
Pre-training
The GPT-OSS models were pretrained on a massive text-only dataset with trillions of tokens focused on STEM, coding, and general knowledge (with a knowledge cutoff of June 2024) while applying filters to remove harmful or sensitive content, especially around biosecurity.
One of the most significant technical details is the native MXFP4 quantization. The MoE layers were trained using this low-precision format, allowing the massive 120B model to run on a single 80GB GPU and the 20B model to fit within 16GB of memory. This is a game-changer for accessibility.
Post-training
After pretraining, the GPT-OSS models underwent post-training using Chain-of-Thought (CoT) reinforcement learning, similar to OpenAI’s o3 models, to improve step-by-step reasoning and agentic behavior. They were taught to use tools like:
A browser for web search
A Python execution environment
Developer-defined functions
Reasoning of GPT-OSS is configurable at three levels: low, medium, and high via system prompt directives like “Reasoning: high”, which control the average CoT length and accuracy.
Something to remember: inference for new frontier open models isn’t easy, especially with a new format like harmony and the volume of interest that gpt-oss is getting out of the gate.
Early spikes can temporarily affect quality, accuracy, and overall "vibes," particularly just 24 hours post-release when providers are racing against the clock with barely any sleep!
What Is Harmony? GPT-OSS's New Prompting Format
One of the gnarliest problems in the LLM ecosystem is prompt templating. With this release, OpenAI is proposing a solution: Harmony. It’s a new, open-source (Apache 2.0) response format that is clearly designed to mimic OpenAI's proprietary Responses API. And here lies the crux of the strategy: by encouraging the entire open-source ecosystem to adopt a format that mirrors their own paid platform, OpenAI is building an on-ramp. Developers who build tooling, agents, and workflows around the open Harmony format will find it incredibly easy to switch or upgrade to OpenAI's more powerful, multimodal proprietary APIs later.
Splitting up system and developer as a message type better reflects how we are using system prompting, and the formalization of the output channels means that applications can start to standardize outputs. Commentary will show tool calls, but the analysis channel is its chain of thought (CoT), and in the spec its marked:
*Important:* Messages in the analysis channel do not adhere to the same safety standards as final messages do. Avoid showing these to end-users.`
It’s a masterclass in platform strategy. It is an open-source contribution, but it’s also a potential lock-in mechanism that makes OpenAI’s ecosystem the default for a new generation of agentic applications. Harmony introduces several concepts that are new to the open-weight world:
Expanded Roles: Beyond user and assistant, Harmony defines roles for system, developer, and tool.
Output Channels: It provides three distinct output streams: final (for the user), analysis (the internal chain-of-thought), and commentary (for tool interactions).
Robust Special Tokens: Harmony uses a new tokenizer (o200k_harmony) with dedicated token IDs for instructions like <|call|> and <|channel|>, making the format far more robust.
This is a serious attempt to standardize how we communicate with complex, agentic models, and OpenAI is making it easy for the ecosystem to adopt by open-sourcing the renderer in both Python and Rust. (And to switch to their paid API.)
That’s what we have from the technical side of OpenAI’s freshest model. And here is how it performed when put to the test.
Hands-On: Does It Actually Work (and How)?
Benchmarks are great, but the real test is getting your hands dirty. Simon Willison – as always checked the model on his personal “Pelican on a bicycle” eval. The results are here. Will Schenk (from Thefocus.ai) shared his results with us:
“gpt-oss feels like a foundation model from a year ago. Its polished and one of the strongest text-only models you can run locally, now available for remixing. I expected that small models would become more capable, but its still surprising that my 4 year old laptop can run the smaller of the two released models. It's text only, which rules out some of the use cases that gemma 3 in particular is good at, but the over all result feels very polished.
Tool use in general is good but can get stuck in a loop when trying to reason. Agentic workflows aren't there yet but it feels very close. It can write code surprisingly well for running locally, but when it started called tools to actually write the files it fell down. The 20b sized one is not to the level when it can property back something like goose or RooCode. Goose was able to design and write code to the console, but when it started called tools to actually write the code in the filesystem it fell down. RooCode was completely confused by the assignment even with the context window maxed out.
Even without the actual source, these models give you the freedom to run, freedom to study, and the freedom to modify.”
How to Run GPT-OSS Locally: Ollama & LM Studio
You can try it on Hugging Face: https://huggingface.co/openai/gpt-oss-120b
Ollama
Download ollama. Start it up. In the terminal pull down the model, and then lets go
And then run
ollama pull gpt-oss:20band then run
ollama run gpt-oss:20b "why is the sky dark at night"Or use the front end.
lmstudio
Download lmstudio and then it will prompt you

You can use the command line or the interface.
GPT-OSS Context Size and Reasoning Effort
The larger the context size, the more that it thinks. You can as to to think on three levels using the prompt, but the context is also important.
Ollama has a default context size of:
$ ollama run gpt-oss:20b
>>> /show info
Model
architecture gptoss
parameters 20.9B
context length 131072
embedding length 2880
quantization MXFP4 lmstudio defaults to so you need to change the context length and then reload the model.
GPT-OSS Benchmarks: How It Compares to DeepSeek & Qwen3
The reaction to the model's performance has been complex. While its reasoning on specific tasks is top-tier, some users have reported a "hallucination fiesta" and poor scores on out-of-distribution benchmarks.
This points to a "spiky" or "brittle" performance profile. The model appears to be a genius at the tasks it was trained for, but struggles when it strays from them. This has led to widespread speculation that GPT-OSS was heavily trained on synthetic data. Using synthetic data is a smart way to scale training while avoiding copyright issues and reducing harmful content. However, the downside is that it can lead to over-optimization, creating models that ace benchmarks but lack the general robustness of models trained on messier, real-world data. For anyone considering GPT-OSS for production, this is a critical factor to evaluate.
Here is a completely wrong answer it gave us:

For comparison, here is chatGPT:

To be fair: you can connect web search via Ollama to empower gpt-oss pretty easily.
Is GPT-OSS the Best Open-Weight Model?
So, is GPT-OSS the new king of open source? Not quite. But it’s important – a very important – step for OpenAI and an inspiration for others.
Independent evaluations confirm it’s the most intelligent American open-weights model, but it doesn't definitively dethrone top Chinese models like DeepSeek R1 or Qwen3-235B on raw intelligence. Its strength lies in its incredible efficiency – achieving near-parity with far fewer active parameters.
However, there's a catch for the research community: OpenAI did not release the base models. They only released the final, instruction-tuned, and safety-aligned models. This makes them much harder for researchers to study, modify, and build upon from first principles. For many in the "true open source" community, dense base models are the most valuable artifacts, and their absence here is a significant limitation.
Safety, Red Teaming, and Worst-Case Scenarios
Another limitation comes from safety measures. OpenAI has been extremely deliberate about the safety narrative for this release.
The most notable step? They directly assessed the risk of malicious fine-tuning. Simulating an attacker, they adversarially fine-tuned gpt-oss-120b on specialized biology and cybersecurity data to create a non-refusing version. The conclusion, reviewed by external experts, was that even this souped-up version did not reach "High" capability levels for catastrophic risk according to their Preparedness Framework.
To further this, OpenAI has launched a $500,000 Red Teaming Challenge, inviting the community to help identify novel safety issues.
However, this proactive safety posture comes with trade-offs. By making it harder to fine-tune away the built-in guardrails, OpenAI is drawing a clear line around how much control developers truly have. For some in the research and open-source communities, this feels less like responsible alignment and more like a restriction. After all, if a model can’t be deeply modified, how useful is it for open-ended research?
GPT-OSS: Official Results more evals needed
As for performance, according to OpenAI, gpt-oss-120B matches or exceeds leading proprietary OpenAI models like o4-mini across high-stakes reasoning and tool use tasks:
In math and science it scored 98.7% on AIME 2025 and 83.3% on GPQA Diamond, outperforming o4-mini and o3-mini on both.
In coding, it matched o4-mini on Codeforces, and reached 67.8% on SWE-Bench.

Image Credit: Introducing gpt-oss blog
It also showed strong developer tool use with 70.4% on τ-Bench function calling.
In medically grounded scenarios, it completely outperformed o4-mini, GPT-4o and o3.
Despite being ~6× smaller, GPT-OSS-20B remains competitive, surpassing o3-mini in most key domains.
Conclusion: Your Guide to the GPT-OSS Family
We've explored OpenAI's dramatic return to open weights source, from the architecture to the community reaction. Here’s a quick summary to help you decide what to do next.
Model | Strengths | Key Tech | Best For |
gpt-oss-120b | Near o4-mini reasoning, strong tool use, highly efficient. | MoE (128 experts), MXFP4 quantization, Harmony format. | Production workloads, complex agentic tasks, and research on a single 80GB GPU. |
gpt-oss-20b | o3-mini level performance, excellent reasoning for its size. | MoE (32 experts), MXFP4 quantization, Harmony format. | Local/on-device inference, rapid prototyping, fine-tuning for specialized tasks on consumer hardware. |
So, where does this leave us?
If you need a production-grade open model with top-tier reasoning and tool use, and have access to a high-end GPU, gpt-oss-120b is likely the new best-in-class American model for efficiency and performance.
If you're a developer who wants to run a powerful model locally for coding and experimentation, gpt-oss-20b is quite a gift. It offers incredible performance in a tiny package. And it’s really fast, faster than DeepSeek.
OpenAI’s re-entry has reignited the open-source landscape. It’s a phenomenal step forward, but the lack of a base model shows they are still keeping their most valuable assets close to the chest. This is a cool month with a very cool release, and we can't wait to see what everyone builds and what’s next for the US open-source movement.
Sources and further reading
Official Releases
Community:
“When Sam Altman told me…” by Clem Delangue
gpt-oss: OpenAI validates the open ecosystem (finally) by Nathan Lambert
Sebastian Raschka’s tweet
OpenAI’s new open weight (Apache 2) models are really good by Simon Willison
smol.ai AI News for 8/5/2025
“hallucination fiesta” thread on Twitter by Lisan al Gaib
Resources from Turing Post











