AMD MI355X beats Blackwell in 9 of 10 benchmarks

The gloves are off in the AI hardware race. AMD’s MI355X is no longer playing catch-up. It’s punching through. In nine out of ten enterprise AI benchmarks, the MI355X outperforms Nvidia’s Blackwell B200. That’s not marketing fluff. That’s lab-tested throughput, latency, and memory bandwidth. The hardware moat Nvidia built over the last decade is showing cracks. CUDA is the last pillar holding it up.

Signal65’s June 2025 benchmark report lays it out. The MI355X delivered 1.35x geometric mean performance across 11 LLM configurations. On DeepSeek-R1, it pushed 1.5x higher throughput. On Llama3.1-405B, it ranged from near parity to more than 2x Nvidia’s published results. MLPerf LoRA fine-tuning on Llama2 70B ran 9.6% faster on a single MI355X 8-GPU system than on a four-node MI300X cluster. That’s a generational leap.

$AMD MI355X outperforms Blackwell in 9 out of 10 metrics.

The hardware moat is no longer there. What keeps $NVDA on top now is CUDA.$AMD knows this and is rapidly developing ROCm to challenge CUDA.

Once ROCm also catches up, $AMD will skyrocket. pic.twitter.com/buOoSvydOI

— Oguz O. | 𝕏 Capitalist 💸 (@thexcapitalist) July 11, 2025

The MI355X packs 288 GB of HBM3e memory per card. That’s 96 GB more than Blackwell’s 192 GB. Bandwidth hits 8 TB/s, compared to 5 TB/s on the B200. That extra memory isn’t just for show. It allows full Llama3 70B checkpoints in FP8, 128k-token KV caches without spilling to host RAM, and full Llama-405B FP4 models on every GPU. No tensor-parallel gymnastics. Just raw throughput.

ROCm is the other half of the story. AMD’s open-source software stack is catching up fast. The AITER inference library now supports FP4, FP8, and other low-precision formats. ROCm integrates with vLLM, Hugging Face ONNX, SGLang, and Modular Max. The software gap is narrowing. CUDA still holds the ecosystem, but ROCm is no longer a side project. It’s a direct challenge.

Oracle’s March 2025 announcement adds weight. They’re buying 30,000 MI355X units for a new AI cluster. That’s not a test run. That’s a deployment. AMD’s datacenter revenue hit $12.58B in 2024, with Instinct GPU sales breaking $5B. The MI355X is already shipping to hyperscalers and will be broadly available in Q3 through Dell, HPE, Supermicro, and others.

The MI355X runs at 1400W and requires liquid cooling. It’s not for desktops. It’s for racks. AMD’s roadmap includes the MI400X in 2026 and MI500X in 2027. The company is scaling up production and pushing toward 128-GPU racks. The architecture uses CDNA 4, built on TSMC’s 3nm node, with eight compute dies and 288 GB of HBM3e stitched together using 2.5D and 3D hybrid bonding.

The performance gap is closing. The software moat is under pressure. AMD is not just competing. It’s executing.

Sources

https://signal65.com/wp-content/uploads/2025/06/Signal65-Insights_AMD-Instinct-MI355X-Examining-Next-Generation-Enterprise-AI-Performance.pdf

https://www.xda-developers.com/amd-mi350x-mi355x-launch/

https://www.theregister.com/2025/06/12/amd_mi355x/

https://www.nextplatform.com/2025/02/04/amd-moves-up-instinct-355x-launch-as-datacenter-biz-hits-records/

https://finance.yahoo.com/news/mi355x-amds-most-powerful-ai-183400836.html

Leave a Comment Cancel reply