H100 Freakonomics: The economical tiers of H100's
How size matters, for H100 networking, and cost. Why Elon Musk’s, x.ai supercluster costs over 4 Billion in dollars.
This is follow-up footnotes - for the previous viral article, originally posted on latent.space. That didn’t make it into the original article (too much math).
This is meant to be a “living post” for any future amendments as well.
It is recommended reading this first for context:
TLDR: Why do we approximate the CAPEX cost of an H100 at 50K per GPU
My spreadsheets for calculations can be found here
Not all H100 clusters are made equal, broadly speaking they are segmented into the following bands, decided by their, InfiniBand.
<= 2 nodes: Direct Infiniband interconnect (pretty much only unis)
<= 64 nodes: 1 layer of Infiniband switches (common small clusters)
<= 2048 nodes: 2 layers of Infiniband switches (most common size range)
<= 65,536 nodes: 3 layers of Infiniband switches (handful, big players only)
H100 Training clusters are extremely network-hungry, due to the nature of the training process, and are typically deployed with 3.2Tbps Infiniband interconnect. Built using 8 x 400 Gbps connections.
The problem: as the cluster size increases, the more complicated this network becomes to support the levels of interconnectivity required.
This as a result has a direct impact on the unit economics.
PS: 100k H100’s is the approximate cluster size for x.ai cluster
Where each tier of the cluster node sizes (<=64, <= 2048, >2048), has a bump up in the H100 node CAPEX economics; Ranging approximately between $47k to $52k per H100. And why I used the $50k average previously.
Additionally, due to the sheer energy scale of these GPUs. Where each H100 server, consumes as much energy as 4 US households. The numbers projected here are conservative.
For small clusters (<=64), these can possibly be retrofitted in existing data centers.
For larger clusters, the energy consumption starts to rival that of Small or even medium-sized cities. Dedicated cooling and power lines, or even generators will be needed. Dramatically, increasing the setup unit costs involved.
Combined with the high failure rate of H100 (they have <99.9% uptime), Nvidia themselves project a 1:1 ratio for the cluster facility CAPEX & operation costs. So we are being very conservative by using only 50k per H100 node instead.
I will not be surprised if the real number is easily 1.2x higher. Especially for larger clusters, due to how price inelastic H100 servers are when buying from nvidia.
Shoutout: Semianalysis related write ups
These folks are among the best in the industry, for these kind of analysis. And is worth reading, if these kinda of infrastructure deep dives is your thing.
Answers the question on: How much will the setup cost be (minus facility), when you do custom networking solutions instead of following nvidia reference
H100 networking, and failure rates, in even more details
Amendment: in electricity cost calculation
Shout out to
and @youjiaccheng for spotting the minor math error, which has been fixedFeatherless.AI plug …
What we do …
At Featherless.AI - We currently host the world’s largest collection of OpenSource AI models, instantly accessible, serverlessly, with unlimited requests from $10 a month, at a fixed price.
We have indexed and made over 2,000 models ready for inference today. This is 10x the catalog of openrouter.ai, the largest model provider aggregator, and is the world’s largest collection of Open Weights models available serverlessly for instant inference. Without the need for any expensive dedicated GPUs
And our platform makes this possible, as it’s able to dynamically hot-swap between models in seconds.
It’s designed to be easy to use, with full OpenAI API compatibility, so you can just plug our platform in as a replacement to your existing AI API for your AI agents. Running in the background
And we do all of this; As we believe that AI should be easily accessible to everyone, regardless of language or social status.
Additional Sources:
- ChatGPT launch date: Wikipedia
- H100 launch date: Tech Power Up Database
- The A100 SXM had 624 bf16 TFlops, the H100 SXM was 1,979 bf16 TFlops
- Microsoft & AWS allocated over $40 billion in AI infra alone: Wall Street Journal
- “600 Billion Dollars “ is about: Sequoia’s AI article
- Nvidia investor slides for Oct 2014: page 14 has the pitch for “data centers”
- Semi Analysis: deepdive for H100 clusters, w/ 5 year lifespan approx for components