This part might just be me cargo-culting, but I feel like every research operation needs a compute cluster. Any self-respecting quant should be able expand their computational needs to fill an arbitrarily large number of servers. The cluster I’ve just built is a low-budget clunker, made of a motley bunch of leftover and refurbished servers, linked together with parts off eBay. But I’m very proud of it!
Maximum compute per dollar
New servers cost a fortune, and only a small fraction of the cost is for the actual CPU. These servers are designed for use cases that cannot tolerate downtime, where the administrators are remote, and where all the hardware and even software must be supported by some company with expensive contracts. In contrast, my cluster is designed only for research. Downtime is ok, as long as no data gets lost and I get get back up and running easily. So, my focus is only on maximizing performance given a limited budget. I’m optimizing for compute per dollar. (Incidentally, I’ve found the PassMark CPU mark to accurately reflect how well each CPU can handle my workload.)
There is a stigma around buying refurbished enterprise grade equipment that I don’t understand. Basic compute servers that cost $25k three years ago now cost only $2.5k, refurbished at places like metservers.com or stalliontek.com. Both of these companies provide warranties, too. Even better, these are real servers that already exist and can be shipped to you immediately, rather than waiting months for new ones due to things like worldwide memory shortages. New Mellanox 100GbE infiniband cards cost $795 each, but on eBay 56GbE cards can be bought for $40 each.
NVMe vs. memory
Memory can really drive up the cost of a server, doubling or tripling the price. I don’t think loading up on RAM is cost-effective at scale. Instead, I recommend NVMe drives as an affordable alternative. Typical RAM for a refurbished Dell R630 server would be DDR4-2133, which has bandwidth of 136Gbps. The Samsung 970 EVO Plus 2TB NVMe drive has a read speed of 28Gbps. With the right software, an old infiniband card can max out its 56Gbps bandwidth by reading simultaneously from NVMe drives on only 2-3 other boxes in the cluster. For my workload, this is close enough to RAM speed that I/O ceases to be a bottleneck, and I can focus on just getting the computations done.
I have also chosen to go with retail NVMe drives. They cost far less than enterprise NVMe drives, and they have the same speed (PCIe Gen 3.0 x4) as all but the very newest enterprise drives. The advantage of enterprise drives is the longer lifetime, measured in hundreds or thousands of Terabytes written). But I tend to read far more than I write. Another advantage is that some enterprise drives are dual-port. This is a high-availability feature that allows two hosts to access the same drive, keeping it connected in case of host failure. But as I’ve said, I don’t need expensive high-availability features.
Having a distributed filesystem simplifies coding on a cluster. It makes it feel almost like just working on one big machine. Each job reads and writes to a shared filesystem that is mounted locally, using traditional posix system calls.
After searching out filesystem reviews, I decided to use MooseFS for my robust storage. It is easy to configure. It can handle my collection of drives of all sizes, and is robust to the failure of a drive, or even an entire server. It also has a nice browser-based monitoring tool. I have set it up to store one copy of each data chunk on an SSD, and the replicated chunk on a regular spinning disk. The clients are configured to prefer SSD chunkservers, which makes reading reasonably fast. Note: chunkserver labels apply to the whole server, so don’t mix SSDs and HDDs in one server if you want to explicitly prioritize reading from SSDs.
I considered paying for MooseFS Pro, but decided it was too expensive. For a 20TB lifetime license for versions 3.x and 4.x, I was quoted $1,620, or $810 if it was for non-commercial use. The main two benefits of getting a Pro license are 1) getting the high-availability of multiple online master servers instead of just metaloggers, and 2) getting erasure coding for more efficient use of storage space. The erasure coding is interesting to me, but for slow storage, big disks are really cheap. So, storing multiple full copies of a file isn’t such a big deal.
For serious speed, I’ve chosen BeeGFS with NVMe drives. BeeGFS supports RDMA (remote direct memory access) with infiniband, so it can move data between boxes without involving the CPUs. It is very fast. It is also relatively easy to configure. I am treating this sort of like volatile storage, and I have not set up “buddy mirrors.” Since I will lose data if my hardware fails, I frequently rsync with the robust storage. I was disappointed to find out that even Pro BeeGFS doesn’t support erasure coding. It would make more sense to use with these expensive NVMe drives. However, erasure coding also slows down both reading and writing. So, I’m ok with giving up robustness in order to have one blazing fast filesystem.
Benchmarking a distributed filesystem is complicated and workload-dependent. But everything is working as I hoped. My cluster is mostly hyperconverged, with CPU and storage combined in each server. However, I do have some servers that are clients/CPU only. They are less powerful, so I keep them powered off until needed, to conserve energy. I got an APC AP7911A rack-mount PDU on eBay, so controlling power to the different ports is easy.
Building a cluster has been a lot of fun, and previously slow processes are now excitingly fast. But I’m anxious to begin real research now, and stop messing around with infrastructure.
Appendix A, configuring infiniband on CentOS 7
As a non-HPC guy, learning about infiniband, and getting the network functioning was the hardest part of building the cluster. It took me a long time, and lots of reading and trial-and-error. For that reason, I think it’s worth posting detailed instructions on what eventually worked for me. I don’t believe I’m getting all that is possible out of my infiniband network, but I’m still very pleased with it.
I went with 79DJ3 Mellanox ConnectX-3 CX353A FDR InfiniBand + 56GbE/ 40GbE Single QSFP+ RDMA cards. The most recent ones I ordered on eBay were $35 each. I believe the PCIe lanes cannot handle the full bandwidth of the dual-port cards, which is why I stayed with the simpler single-port card/setup. I did have to order replacement brackets for a couple of my high-profile-PCIe computers.
I went with Mellanox MC2207130-0A1 1.5M IB FDR QSFP copper cables for about $20 each. Fiber optic cables are better for long distances, but these passive cables have worked perfectly.
There are two switches that will work. The first is a small unmanaged switch, the Mellanox SX6005. It runs about $90 used:
The second is the larger, managed, Mellanox SX6036. It runs about $300 used:
If you have more than one switch involved, you can daisy chain them together. You can even run multiple cables between them, which will reduce the bottleneck between the switches. No special configuration is necessary, just plug in multiple cables, and it will spread the load among them to some degree.
There needs to be exactly one subnet manager for the infiniband network. The managed switch can provide this service, but you need to enable it in the configuration interface. The unmanaged switch cannot provide this service. In that case, you need to run a subnet manager on one server. It is trivial to install on CentOS 7:
yum install opensm systemctl enable opensm systemctl start opensm
I’m using CentOS 7 on my cluster, because at this time neither MooseFS nor BeeGFS supports CentOS 8. When I first played around with Ubuntu, it was much more difficult to get infiniband working. And I also had to downgrade its kernel to get BeeGFS working. I don’t think it is worth all that hassle, and CentOS 7 is working great.
# install packages yum groupinstall "Infiniband Support" yum install net-tools mstflint infiniband-diags iperf # disable firewall systemctl status firewalld systemctl stop firewalld systemctl disable firewalld # disable SELINUX nano /etc/selinux/config # set, then reboot SELINUX=disabled # start the RDMA service systemctl start rdma systemctl enable rdma
Updating the card firmware
After installing the infiniband card, find out the PCI address:
# Check the device’s PCI address lspci | grep Mellanox # 04:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3] # so "04:00.0" is the address
Next, use the PCI address to find the card’s PSID, and note the current firmware version:
# Identify the adapter card's PSID (last line of the output) mstflint -d 04:00.0 q #Image type: FS2 #FW Version: 2.32.5100 #FW Release Date: 3.9.2014 #Rom Info: type=PXE version=3.4.306 proto=IB #Device ID: 4099 #Description: Node Port1 Port2 Sys image #GUIDs: e41d2d0300b2bdc0 e41d2d0300b2bdc1 e41d2d0300b2bdc2 e41d2d0300b2bdc3 #MACs: e41d2db2bdc1 e41d2db2bdc2 #VSD: #PSID: DEL1100001019
Now use the PSID to find the latest firmware version:
# Download the firmware BIN file from the Mellanox website that matches your card's PSID: http://www.mellanox.com/page/firmware_table_dell?mtag=oem_firmware_download Adapters Dell EMC ConnectX-3 Firmware Download Center 2.42.5000 079DJ3 DEL1100001019 http://www.mellanox.com/downloads/firmware/fw-ConnectX3-rel-2_42_5000-079DJ3-FlexBoot-3.4.752.bin.zip
So, the latest driver for the 079DJ3 card can be found here. Download and unzip it. Next, burn the new firmware into the card:
# burn firmware to lspci device id mstflint -d 04:00.0 -i /root/fw-ConnectX3-rel-2_42_5000-079DJ3-FlexBoot-3.4.752.bin b #ctX3-rel-2_42_5000-079DJ3-FlexBoot-3.4.752.bin b # Current FW version on flash: 2.32.5100 # New FW version: 2.42.5000 #Burning FS2 FW image without signatures - OK #Restoring signature - OK reboot # Query the device’s parameters to validate the new configuration mstconfig -d 04:00.0 q
IP over IB
One cool thing is that you can give your infiniband card an IP address. You can send IP packets over the infiniband protocol. It isn’t efficient to do that, but it is still crazy fast compared with the usual 1GbE ports.
# config ib0 nano /etc/sysconfig/network-scripts/ifcfg-ib0 DEVICE=ib0 TYPE=infiniband BOOTPROTO=static IPADDR=10.10.10.101 NETMASK=255.255.255.0 NETWORK=10.10.10.0 BROADCAST=10.10.10.255 ONBOOT=yes NM_CONTROLLED=no # then restart the network or the whole machine systemctl restart network # Validate new firmware burned successfully: ifconfig ethtool -i ib0 # or ethtool ib0 #driver: ib_ipoib #version: 1.0.0 #firmware-version: 2.42.5000 #expansion-rom-version: #bus-info: 0000:04:00.0 #supports-statistics: yes #supports-test: no #supports-eeprom-access: no #supports-register-dump: no #supports-priv-flags: no # Verify drivers and ConnectX-3 are up ibstat ibstatus
# test infiniband RDMA # on one server (10.10.10.101), run: rdma_server # on another server (the client), run: rdma_client -s 10.10.10.101 #rdma_client: start #rdma_client: end 0 # infiniband performance test # on the server run ib_send_bw -d mlx4_0 -i 1 -F --report_gbits # on the client run ib_send_bw -d mlx4_0 -i 1 -F --report_gbits 10.10.10.101 # #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] # 65536 1000 37.70 37.70 0.071910 # Sadly, not 56Gbps. I'll keep working on this. # test ethernet speed with another box # first start a listening server: iperf -s # then aim at the server to run the test iperf -c 10.10.10.101 iperf -c 10.10.10.101 -P 2 #------------------------------------------------------------ #Client connecting to 10.10.10.101, TCP port 5001 #TCP window size: 314 KByte (default) #------------------------------------------------------------ #[ 5] local 10.10.10.102 port 42002 connected with 10.10.10.101 port 5001 #[ 3] local 10.10.10.102 port 42000 connected with 10.10.10.101 port 5001 #[ ID] Interval Transfer Bandwidth #[ 5] 0.0-10.0 sec 16.4 GBytes 14.1 Gbits/sec #[ 3] 0.0-10.0 sec 16.3 GBytes 14.0 Gbits/sec #[SUM] 0.0-10.0 sec 32.7 GBytes 28.1 Gbits/sec # latency test ib_read_lat ib_write_lat 10.10.10.101 # another RDMA test on one server, run "udaddy" on the client, run "udaddy -s 10.10.10.101" #udaddy: starting client #udaddy: connecting #initiating data transfers #receiving data transfers #data transfers complete #test complete #return status 0