H100 Freakonomics: The economical tiers of H100's

How size matters, for H100 networking, and cost. Why Elon Musk’s, x.ai supercluster costs over 4 Billion in dollars.

Oct 13, 2024

The Tulip Folly - Painted after the first speculative bubble in recorded history - on tulip prices which kept climbing in 1634 and collapsed in February 1637

This is follow-up footnotes - for the previous viral article, originally posted on latent.space. That didn’t make it into the original article (too much math).

This is meant to be a “living post” for any future amendments as well.

It is recommended reading this first for context:
Latent Space
$2 H100s: How the GPU Bubble Burst
Don’t miss Eugene’s responses on the HN and Reddit and Twitter discussions…
Read more
10 months ago · 55 likes · 11 comments · Eugene Cheah

TLDR: Why do we approximate the CAPEX cost of an H100 at 50K per GPU

My spreadsheets for calculations can be found here

Not all H100 clusters are made equal, broadly speaking they are segmented into the following bands, decided by their, InfiniBand.

<= 2 nodes: Direct Infiniband interconnect (pretty much only unis)
<= 64 nodes: 1 layer of Infiniband switches (common small clusters)
<= 2048 nodes: 2 layers of Infiniband switches (most common size range)
<= 65,536 nodes: 3 layers of Infiniband switches (handful, big players only)

H100 Training clusters are extremely network-hungry, due to the nature of the training process, and are typically deployed with 3.2Tbps Infiniband interconnect. Built using 8 x 400 Gbps connections.

The problem: as the cluster size increases, the more complicated this network becomes to support the levels of interconnectivity required.

This as a result has a direct impact on the unit economics.

PS: 100k H100’s is the approximate cluster size for x.ai cluster

Where each tier of the cluster node sizes (<=64, <= 2048, >2048), has a bump up in the H100 node CAPEX economics; Ranging approximately between $47k to $52k per H100. And why I used the $50k average previously.

Additionally, due to the sheer energy scale of these GPUs. Where each H100 server, consumes as much energy as 4 US households. The numbers projected here are conservative.

For small clusters (<=64), these can possibly be retrofitted in existing data centers.

For larger clusters, the energy consumption starts to rival that of Small or even medium-sized cities. Dedicated cooling and power lines, or even generators will be needed. Dramatically, increasing the setup unit costs involved.

Combined with the high failure rate of H100 (they have <99.9% uptime), Nvidia themselves project a 1:1 ratio for the cluster facility CAPEX & operation costs. So we are being very conservative by using only 50k per H100 node instead.

I will not be surprised if the real number is easily 1.2x higher. Especially for larger clusters, due to how price inelastic H100 servers are when buying from nvidia.

Shoutout: Semianalysis related write ups

These folks are among the best in the industry, for these kind of analysis. And is worth reading, if these kinda of infrastructure deep dives is your thing.

Answers the question on: How much will the setup cost be (minus facility), when you do custom networking solutions instead of following nvidia reference

SemiAnalysis

AI Neocloud Playbook and Anatomy

The rise of the AI Neoclouds has captivated the attention of the entire computing industry. Everyone is using them for access to GPU compute, from enterprises to startups. Even Microsoft is spending ~$200 million a month on GPU compute through AI Neoclouds despite having their own datacenter construction and operation teams. Nvidia has heralded the rapid growth of several AI Neoclouds through direct investments, large allocations of their GPUs, and accolades in various speeches and events…

10 months ago · 103 likes · 9 comments · Dylan Patel and Daniel Nishball

H100 networking, and failure rates, in even more details

SemiAnalysis

100,000 H100 Clusters: Power, Network Topology, Ethernet vs InfiniBand, Reliability, Failures, Checkpointing

There is a camp that feels AI capabilities have stagnated ever since GPT-4’s release. This is generally true, but only because no one has been able to massively increase the amount of compute dedicated to a single model. Every model that has been released is roughly GPT-4 level (~2e25 FLOP of training compute). This is because the training compute dedicated to these models have also been roughly the same level. In the case of Google’s Gemini Ultra, Nvidia Nemotron 340B, and Meta LLAMA 3 405B, the FLOPS dedicated were of similar magnitude or even higher when compared to GPT-4, but an inferior architecture was utilized, resulting in these models falling short of unlocking new capabilities…

a year ago · 170 likes · 22 comments · Dylan Patel and Daniel Nishball

Amendment: in electricity cost calculation

Shout out to

Richard Yeh

and @youjiaccheng for spotting the minor math error, which has been fixed

Featherless.AI plug …

What we do …

At Featherless.AI - We currently host the world’s largest collection of OpenSource AI models, instantly accessible, serverlessly, with unlimited requests from $10 a month, at a fixed price.

We have indexed and made over 2,000 models ready for inference today. This is 10x the catalog of openrouter.ai, the largest model provider aggregator, and is the world’s largest collection of Open Weights models available serverlessly for instant inference. Without the need for any expensive dedicated GPUs

And our platform makes this possible, as it’s able to dynamically hot-swap between models in seconds.

It’s designed to be easy to use, with full OpenAI API compatibility, so you can just plug our platform in as a replacement to your existing AI API for your AI agents. Running in the background

And we do all of this; As we believe that AI should be easily accessible to everyone, regardless of language or social status.

Additional Sources:
- ChatGPT launch date: Wikipedia
- H100 launch date: Tech Power Up Database
- The A100 SXM had 624 bf16 TFlops, the H100 SXM was 1,979 bf16 TFlops
- Microsoft & AWS allocated over $40 billion in AI infra alone: Wall Street Journal
- “600 Billion Dollars “ is about: Sequoia’s AI article
- Nvidia investor slides for Oct 2014: page 14 has the pitch for “data centers”
- Semi Analysis: deepdive for H100 clusters, w/ 5 year lifespan approx for components

Tech Talk CTO

Discussion about this post