AI inference provider performance inconsistencies

Soon after the OpenAI gpt-oss release, the community noticed stark inconsistencies in performance across inference providers - particularly with AWS Bedrock which sometimes produced inconsistent outputs that were not present with other providers. Artificial Intelligence quantified underperformance analysis reported results of gpt-oss-120b via AWS Bedrock on the GPQA Diamond @16 benchmark at -6.3 pp, on AIME 2025 @32 at -10% (together with Google Vertex AI) and on IFBench @16 at -4.9% versus the leading providers - with the #1 spot claimed by a different provider for each of the benchmarks. (Higher accuracies are now being reported).

My report about gpt-oss-20b security evaluations finds a difference of sometimes 10 pp or more between an Nvidia RTX 5090 + Hugging Face Transformers and an Nvidia H100 + vLLM inference stack. Early pilot testing using OpenRouter showed inconsistencies as well.

Such differences are receiving more attention now, with numbers reported for DeepSeek V3.1 Terminus Non-reasoning (up to -19.08 pp “Similarity compared to the official Implementation”) and Moonshot Kimi K2 (with up to -38.45 pp similarity). Anthropic had previously reported intermittently degraded responses partially stemming from the share of requests served from Google TPU hardware, rather than from AWS Trainium or Nvidia GPUs.