September 2, 2017•blog
When designing a GPU cluster, it is important to keep in mind an upgrade pathway. This post suggests a possible sequence of designs that will allow you to move from a four-GPU cluster to a 100-GPU cluster while ensuring that the upgrade path is clear.
The basic principle of this design is to keep track of the next limiting factor, out of: power, cooling, space, and storage bandwidth, and ensure that we adjust our design as we hit various limits.
Here’s the design sequence, presented by scale. Depending on the needs of your department, you may need to tweak this progression.
Before we start, keep these points in mind:
We start with our very first pair of machines. Here’s how we arrange them:
We split the responsibilities into two machines:
Users log in to the bastion server, which forwards them to the compute server. The chief advantage of this arrangement is that all compute machines appear uniform to users. This also makes them ephemeral, which allows us (once we scale the system) to take them down and replace them with minimal disruption to service.
When designing the bastion, keep these in mind:
GPU servers:
Now we start scaling up! At this point, we are expanding from one compute machine to four machines.
At this scale, we add a switch to provide routing between the bastion and the compute hosts. A 24-port gigabit ethernet switch is still sufficient at this scale; a large NFS cache on the client machines will, once the working set is cached, provide sufficient read performance.
We also add an uninterruptible power supply for the bastion host. A 250VA system will allow the bastion host to remain online through temporary power brownouts or circuit resets. We do not provide battery backups to the compute servers because UPSes capable of delivering the required wattage tend to be very expensive, and the bastion can bring them back online with Wake-On-LAN packets.
This is the regime under which space, cooling, and power are limiting factors. This is also the point where co-locating servers in a datacenter makes sense.
Cooling is the easiest to analyze: a standard 1440W GPU server, under maximum load, requires 5000 BTU/hr of air-conditioning power. You should consult your facilities department for this; a general rule of thumb is that an office building can dissipate at most 20000 BTU/hr. Moving to a datacenter is the cheapest fix for this.
A general rule of thumb is that a single 110V/20A circuit supports two servers; now that we have four–eight machines, we are likely exceedng the number of available power circuits. Adding power circuitry is expensive and depends on the particular building; once again, moving to a datacenter will fix this problem.
Rackmounting the devices allows for much easier wire management, better cooling (through consistent airflow), and much more space-efficient housing. It is also mandatory if you want to co-locate your machines in a datacenter. Here’s the cost breakdown:
Assuming that your datacenter imposes 42U (standard height unit) limit on the height of your rack, each computer is 4U, and the switch and UPS are each 2U, you can fit a total of 8 compute machines, the bastion, UPS, and switch on a single rack. Expect to pay about $1k for a high-quality 42U (standard height unit) extended-depth (for GPU servers) server rack.
Also assuming that your GPU servers draw a maximum of 1440W each, a 21kW three-phase to one-phase converter power distribution unit will allow you to power 15 GPU servers for a cost of about $1.3k. By locating racks in adjacent spaces you can amortize the cost of two PDUs over three racks.
Datacenters charge a colocation fee for servers. When moving your servers to a university-run datacenter, expect to spend about $500/rack/month; this price should include power, cooling, security, environmental monitoring, and (occasional) minor service events. Commercial datacenters charge substantially more, especially for extended-depth racks and service events.
At this point, you should also start running offsite backups. Use a cheap NAS (e.g. a Synology) and run incremental backups nightly. An article on configuring this is forthcoming. Purchase enough storage to hold up to 4x the primary storage of the cluster, and house this machine offsite; preferably in a different zip-code. Test backups often and report backup exceptions.
This is the regime where you move from one rack to three. Remember to amortize the PDU over multiple racks, and consider buying (relatively) cheap 12-port switches to simplify network cabling between racks.
In this range, it is likely that the network throughput is the limiting factor. To increase the NAS throughput:
You should use past performance data to determine what the bottlenecks in your operation are before making the decision.
At this point, we are outside my range of expertise. From my research key changes in this step should be:
If you have write-heavy workloads (on a GPU cluster?), or are reading from a dataset much larger than you can reasonably provide a caches for on your compute servers, you may want to add a data backbone network that uses a faster interconnect. User SSH connections and all other data flow through the gigabit network; this secondary network will only be used for data.
A common trick (for the fledgeling cluster) is to purchase one or two-generation old InfiniBand hardware from eBay or resellers. (In 2019, with 100GbE InfiniBand available, it is possible to buy 10- and 40-GbE InfiniBand PCIe cards and switches for pennies on the dollar.) It is worth getting a professional systems administrator to help select parts and configure this option, if chosen.
I suggest tracking performance and utilization before taking this expensive step; 10GbE might be a sufficient compromise.