This post is intended for hardware geeks that are interested in building high end GPU workstations for engineering, crypto mining and password cracking. It you’re not on that wavelength you can safely ignore all that follows.
I’ve previously reported on my experience building a GPU workstation that I use in my consulting engineering business for analyzing antennas with a Finite Difference Time Domain (FDTD) code from Remcom. Even though I already have a very capable workstation with several high end nVidia Tesla and Quadro graphics cards, there is always pressure to improve performance and do ever harder problems. In particular, last year I took on a project that involved analyzing and designing antennas that are used inside the human body to kill cancerous tumors with microwave energy. Modeling these antennas has been particularly challenging and simulation run times have soared as a result. I’ve thus been researching hardware options to increase throughput and I thought I’d document them here in hopes they might help others evaluating similar choices for their GPU applications.
My current system uses an 8-core AMD FX8350 processor and 16 GB of ECC RAM on an Asus Sabertooth 990FX motherboard. The GPUs consist of one nVidia Quadro FX5800 and two nVidia Tesla C2070 GPU cards. (For those that didn’t read my earlier post, I point out that run time with X-FDTD is almost entirely dictated by the GPU so there is no advantage to using a faster CPU. The AMD 8-core is more than sufficient in this application and vastly cheaper than equivalent an Intel Xeon alternative. Also, the Remcom software only works with nVidia cards, so any arguments for switching to AMD GPUs are moot.) I have a separate RAID file server so the GPU workstation has only a single Intel SSD to run the OS and hold simulation programs and data. To date, it has functioned well with the Remcom software and has been remarkably fast and reliable. The only slight downside was that the Quadro and Tesla cards did not share the same nVidia chipset architecture. This was unavoidable since I purchased the older Quadro before the Fermi based cards came on the market. As a result of the architecture difference, and limitations of the Remcom software, I was unable to utilize all three cards concurrently on the same problem. I could however launch one simulation on the C2070 cards and a second smaller simulation on the Quadro, so it wasn’t a big deal until I started the cancer work and faced considerably more and bigger problems.
nVidia Fermi or Kepler Architecture?
Because of the mixed GPU architecture issue, one of the first upgrade options that I considered was simply to swap out the FX5800 and replace it with a newer Fermi based Quadro 6000 to match the architecture of the C2070s. That looked like an easy and relatively cheap way of getting an immediate 50% speed up on big problems. Additionally, it would also allow me to increase the size of the problems I could solve by about 50% due to the added 6 GB of video RAM on the Quadro 6000.
All this seemed like a good idea until I went to the nVidia website and was presented with the very latest GPU technology known as Kepler and embodied in such wonders as the Quadro K6000 and the Tesla K40. Wow! I could hardly believe my eyes! These cards have an astounding 2880 CUDA cores and 12 GB of RAM! On paper, nVidia’s specs make it appear that a single K6000 could replace four or maybe even five C2070s in terms of compute power and a pair would positively blow the doors of my existing setup. The investment would be steep, roughly $5k per card, but I figured if I could get 5x speedup over my existing setup that an investment of maybe $15,000 or more would be well worth it to advance my engineering work.
Before I plunked down that kind of cash however I needed some proof that those claimed improvements would actually translate to a commensurate speedup on the X-FDTD software. I contacted Remcom to see if they had any data for their software running on the Keplers. As luck would have it, last year one of their engineers wrote an internal research paper looking at scaling of their algorithm on a cluster of various nVidia cards. While they didn’t have data for the latest K40, they did have data for the recent K20x and many others going back to the older Fermi based C2050/C2070 cards. The paper was chock full of data showing how well their algorithm scaled as the number of GPU cards was increased both on a single machine and in a cluster. Note that I said, number of GPU cards and NOT number of CUDA cores! The most telling plot in the whole paper showed total simulation time for the same typical size problem run on the varying numbers of the various types of nVidia cards. While newer cards like the K20 and K20X were clearly faster than older cards, they were definitely not faster in proportion to the number of CUDA cores!
The Remcom benchmarks showed that moving from a C2090 to a K20 only reduced run time by about 5% on an average test problem. Surprisingly, they pointed out that the C2070 was the clear winner from a value perspective because it had the same amount of RAM and was only slightly slower than the C2090, but much less expensive. Figuring the K40 to be only a wee bit faster than the K20X I estimated that I might see ~30% reduction in run time with a K40 versus one of my old C2070s. That sounds good, but no where near the 5x speed up touted by nVidia! This is definitely a case where spending 5x the money doesn’t get 5x the performance.
Now, don’t get me wrong, I’m not saying that nVidia is lying when they claim that the K6000 is 5x as fast as a C2070. I’d bet it is every bit that fast on some algorithms. Unfortunately, there must be something about the FDTD algorithm, or at least the way it’s coded by Remcom, so that it doesn’t make good use of the extra 2432 cores the K6000 has over the C2070. Alas, I had no excuse to burn cash on the latest nVidia wonder cards.
Going back to the Remcom technical paper, it was clear that throughput scales almost linearly as the number of GPUs increases. This scaling was even possible with a GPU cluster through the use of MPI (Message Passing Interface) that Remcom had recently incorporated into their code. It seemed that my best option was to buy as many Fermi cards as I could reasonably afford and then work out the networking, power and cooling issues involved in a small GPU cluster. I’ll dive into some of that in the next part of this series. Until then, I think it’s clear that you have to look deeper than the marketing material offered by nVidia and AMD in order to judge whether their latest GPU marvels will really pay off in your particular application. Test data on your software is all that matters!
Thanks for reading!
Update Mar 7, 2014
Here’s the link to Adventures in GPU Upgrades – Part 2