Open weight LLMs exhibit inconsistent performance across providers

<p>Artificial Analysis published <a href="https://artificialanalysis.ai/models/gpt-oss-120b/providers#aime25x32-performance-gpt-oss-120b">a new benchmark</a> the other day, this time focusing on how an individual model - OpenAI’s gpt-oss-120b - performs across different hosted providers.</p>

<p>The results showed some surprising differences. Here's the one with the greatest variance, a run of the 2025 AIME (American Invitational Mathematics Examination) averaging 32 runs against each model, using gpt-oss-120b with a reasoning effort of "high":</p>

<p><img src="https://static.simonwillison.net/static/2025/aim25x32-gpt-oss-120b.jpg" alt="Performance benchmark chart showing AIME25x32 Performance for gpt-oss-120B model across different AI frameworks. Chart displays box plots with percentile ranges (Min, 25th, Median, 75th, Max) for each framework. Title: &quot;AIME25x32 Performance: gpt-oss-120B&quot; with subtitle &quot;AIME 2025 N=32 Runs: Minimum, 25th Percentile, Median, 75th Percentile, Maximum (Higher is Better)&quot;. Legend indicates &quot;Median; other points represent Min, 25th, 75th percentiles and Max respectively&quot;. Y-axis ranges from 0 to 1.2. Frameworks shown from left to right: Cerebras (93.3%), Nebius Base (93.3%), Fireworks (93.3%), Deepinfra (93.3%), Novita (93.3%), Together.ai (93.3%), Parasail (90.0%), Groq (86.7%), Amazon (83.3%), Azure (80.0%), CompectAI (36.7%). Watermark shows &quot;Artificial Analysis&quot; logo." style="max-width: 100%;" /></p>

<p>These are some varied results!</p>

<ul>

<li>93.3%: Cerebras, Nebius Base, Fireworks, Deepinfra, Novita, Together.ai, vLLM 0.1.0</li>

<li>90.0%: Parasail</li>

<li>86.7%: Groq</li>

<li>83.3%: Amazon</li>

<li>80.0%: Azure</li>

<li>36.7%: CompactifAI</li>

</ul>

<p>It looks like most of the providers that scored 93.3% were running models using the latest <a href="https://github.com/vllm-project/vllm">vLLM</a> (with the exception of Cerebras who I believe have their own custom serving stack).</p>

<p>I hadn't heard of CompactifAI before - I found <a href="https://www.hpcwire.com/off-the-wire/multiverse-computing-closes-e189m-series-b-to-scale-compactifai-deployment/">this June 12th 2025 press release</a> which says that "CompactifAI models are highly-compressed versions of leading open source LLMs that retain original accuracy, are 4x-12x faster and yield a 50%-80% reduction in inference costs" which helps explain their notably lower score!</p>

<p>Microsoft Azure's Lucas Pickup <a href="https://x.com/lupickup/status/1955620918086226223">confirmed</a> that Azure's 80% score was caused by running an older vLLM, now fixed:</p>

<blockquote>

<p>This is exactly it, it’s been fixed as of yesterday afternoon across all serving instances (of the hosted 120b service). Old vLLM commits that didn’t respect reasoning_effort, so all requests defaulted to medium.</p>

</blockquote>

<p>No news yet on what went wrong with the AWS Bedrock version.</p>

<h4 id="the-challenge-for-customers-of-open-weight-models">The challenge for customers of open weight models</h4>

<p>As a customer of open weight model providers, this really isn't something I wanted to have to think about!</p>

<p>It's not really a surprise though. When running models myself I inevitably have to make choices - about which serving framework to use (I'm usually picking between GGPF/llama.cpp and MLX on my own Mac laptop) and the quantization size to use.</p>

<p>I know that quantization has an impact, but it's difficult for me to quantify that effect.</p>

<p>It looks like with hosted models even knowing the quantization they are using isn't necessarily enough information to be able to predict that model's performance.</p>

<p>I see this situation as a general challenge for open weight models. They tend to be released as an opaque set of model weights plus loose instructions for running them on a single platform - if we are lucky! Most AI labs leave quantization and format conversions to the community and third-party providers.</p>

<p>There's a lot that can go wrong. Tool calling is particularly vulnerable to these differences - models have been trained on specific tool-calling conventions, and if a provider doesn't get these exactly right the results can be unpredictable but difficult to diagnose.</p>

<p>What would help <em>enormously</em> here would be some kind of conformance suite. If models were reliably deterministic this would be easy: publish a set of test cases and let providers (or their customers) run those to check the model's implementation.</p>

<p>Models aren't deterministic though, even at a temperature of 0. Maybe this new effort from Artificial Analysis is exactly what we need here, especially since running a full benchmark suite against a provider can be quite expensive in terms of token spend.</p>

<p id="update"><strong>Update</strong>: <a href="https://x.com/DKundel/status/1956395988836368587">Via OpenAI's Dominik Kundel</a> I learned that OpenAI now include a <a href="https://github.com/openai/gpt-oss/tree/main/compatibility-test">compatibility test</a> in the gpt-oss GitHub repository to help providers verify that they have implemented things like tool calling templates correctly, described in more detail in their <a href="https://cookbook.openai.com/articles/gpt-oss/verifying-implementations">Verifying gpt-oss implementations</a> cookbook.</p>



<p>Here's <a href="https://til.simonwillison.net/llms/gpt-oss-evals">my TIL</a> on running part of that eval suite.</p>



<h4 id="update-aug-20">Update: August 20th 2025</h4>



<p>Since I first wrote this article Artificial Analysis have updated the benchmark results to reflect fixes that vendors have made since their initial run. Here's what it looks like today:</p>



<p><img src="https://static.simonwillison.net/static/2025/gpt-oss-eval-updated.jpg" alt="Performance benchmark chart showing AIME25x32 Performance for gpt-oss-120B model across different AI frameworks. Chart displays box plots with percentile ranges for each framework. Title: &quot;AIME25x32 Performance: gpt-oss-120B&quot; with subtitle &quot;AIME 2025 N=32 Runs: Minimum, 25th Percentile, Median, 75th Percentile, Maximum (Higher is Better)&quot;. Legend indicates &quot;Median; other points represent Min, 25th, 75th percentiles and Max respectively&quot;. Y-axis ranges from 0 to 1.2. Frameworks shown from left to right: Cerebras (93.3%), Nebius Base (93.3%), Azure (93.3%), Fireworks (93.3%), Deepinfra (93.3%), Novita (93.3%), Groq (93.3%), Together.ai (93.3%), Parasail (90.0%), Google Vertex (83.3%), Amazon (80.0%). Watermark shows &quot;Artificial Analysis&quot; logo." style="max-width: 100%" /></p>

<p>Groq and Azure have both improved their scores to 93.3%. Google Vertex is new  to the chart at 83.3%.</p>