Introduction

The GPU (Graphics Processing Unit) is now the preferred computing tool for a host of applications ranging from engineering and medical applications, to oil and gas exploration, 3D games and even hacking passwords using brute force.  Even if you have only a modest desktop or notebook computer you may find that you’re already equipped with a capable GPU. With the right software and/or some serious programming skills you can take advantage of it to do certain types of calculations quicker than they can be done on even the fastest available CPUs. In the future, as GPU technology and the associated software tools improve, you will likely see GPUs take on more and more of the computing tasks normally associated with CPU technology.

I’m a professional engineer and I rely heavily on GPUs to perform the trillions of floating point calculations required in my antenna simulation and design work.  In the world known as Computational ElectroMagnetics (CEM) you can always use more computing power. As a result, I’ve spent considerable time over the last few months configuring a new GPU workstation to bring all the compute power I could reasonably afford to bear on some really challenging problems.

The goal of this post is to relay some the things I’ve learned in this process in the hope that it will save some of you time and money should you try to build your own GPU workstation.

Software Requirements

If you’re looking to build a GPU workstation your first consideration should be the requirements of the software you’re going to be using. In my case, I’m already heavily invested in the Remcom X-FDTD simulator. This program implements a Finite-Difference-Time-Domain (FDTD) solution to the Maxwell equations and is capable of providing accurate answers to very complicated electromagnetic problems. Originally developed for the X-windows system, X-FDTD now runs on Windows, Mac and Linux computers. Interestingly, the company recommends Linux as the best platform for taking advantage of GPU acceleration. For that and other reasons, my operating system of choice is Ubuntu Linux.

CEM software of this sort is very expensive to develop and the number of users is relatively few so it isn’t the kind of thing you’ll find shrink wrapped at Best Buy or sale at Amazon. Indeed, Remcom, like most other companies in this niche market, goes to great lengths to make sure that their software is not pirated. That means that license management software and yearly support fees are a fact of life as is the notion that you have to pay extra to enable certain advanced features in the software.

Case in point, I have a license of the “Pro” level X-FDTD software. The base license allows the solver to be operated in multiple threads on multi-core and multi-CPU architectures. It also allows the use of a single nVidia CUDA capable GPU to accelerate certain portions of the algorithms. If you have more than one GPU in your computer you have to buy extra license “tokens” to enable the software to use the added GPUs. GPU tokens cost on the order of $5,000 each so the software investment quickly exceeds any hardware costs.

At this time, X-FDTD only works with nVidia CUDA capable cards. This means that no matter how good an AMD Radeon might be it has no value with this software. My discussion is therefore limited to nVidia GPUs.

Consumer versus Pro Grade GPUs

You probably already know about the nVidia’s GeForce consumer grade graphics cards. They are great for gaming and home/office needs, but if you look under the hood you will find that they are not suitable for engineering simulation work. If you’re going to build a machine capable of solving real world CEM problems you will need a card that can run full throttle for days or even weeks at time. To handle loads like that you’re going to be looking at high dollar Quadro and Tesla cards. Only these models offer the reliability you need to push simulations through 24/7 month after month.

CEM problems also need a lot of RAM. The Quadro and Tesla have between 3 and 6 GB of ECC (Error-correcting code) memory. GeForce cards however usually offer only about 1 GB and ECC is not an option. Tesla and Quadro cards also offer substantially better double precision throughput than the comparable GeForce models. Double precision and ECC is critical in engineering but is of no consequence in a 3D game. As a result, the GeForce cards are substantially cheaper.

Just to clarify, the Quadro and Tesla cards generally share the same architectures but the Quadro cards are enhanced with additional rendering capabilities and video output options. The Quadro cards are aimed at visualization for CAD and imaging while Tesla cards are aimed at pure number smashing. Depending on model, Tesla cards have minimal or no graphics outputs. Tesla cards are thus less expensive than their Quadro cousins.

The first GPU that I purchased for simulation work was the Quadro FX 5800. This card has 240 CUDA cores and 4 GB of DDR RAM. This card generally gives me a 10x to 50x speed up on X-FDTD calculations as compared to a dual quad-core Xeon machine running at 2.5 GHz. This is a tremendous improvement and makes the added investment in GPUs and license tokens a good investment.

To be able to take on larger problems I later purchased a Tesla C2070 card. These cards have 440 CUDA cores, 6 GB of DDR5 RAM and a single DVI output port. Due to limitations of my existing workstation and licensing I could only use either the Quadro or the Tesla but not both at the same time and in the same machine. The possibility of combining them to do even bigger problems lead me to investigate building a new GPU workstation and purchasing more license tokens and GPUs.

CPU Requirements

One of the first things most people think about when buying or building a new computer is the CPU. In the case of a GPU workstation however I have found that the CPU is a secondary consideration since the majority of the calculations are done on the GPU. I tested this assertion by running X-FDTD benchmarks with the Quadro FX5800 card in the above mentioned Xeon based workstation and again on a system with a low-end 2.1 GHz AMD Phenom X4 (quad-core) CPU. I discovered that run times were nearly identical. This suggested to me that each CPU was providing enough power to fully use the Quadro GPU. Extra cores did not impact simulations but offered a noticeable improvement in responsiveness when simulations were actually being run.

Given this data I decided that money was better spent on license tokens and additional GPUs rather than high end Intel Xeon CPUs. I determined that the best course of action was to build a system around one of the relatively cheap AMD Phenom X6 CPUs.

Motherboard Selection Criteria

ECC Memory

There are several factors that I considered important when looking for a motherboard. One of the biggest was the need to have ECC memory support. While consumers and gamers never consider ECC to be an issue, I feel it’s important to have some control on the inevitable memory errors that will occur during lengthy engineering simulations. Most consumer grade motherboards don’t offer ECC support for either the Intel or AMD processors. Intel’s i series processors (e.g. i5 and i7) do NOT support ECC period. If you want ECC and Intel you must be in the Xeon camp with a server grade motherboard. Strike 2 for Intel. AMD chips supports ECC provided the motherboard has the required chips.

nVidia Support

Since AMD acquired ATI the number of AMD motherboard offerings that support nVidia has decreased substantially. My research suggests that there may be a renaissance of AMD/nVidia boards offerings to coincide with the release of the new AMD FX “Bulldozer” based chips from AMD.  At the time I was working on this I found several AMD based motherboards that supported nVidia boards and provided ECC memory support.

PCIe Slot Configuration – physical size, speed and spacing

All of the nVidia Tesla or Quadro cards require what is known as a PCI Express (PCIe X16 slot). The X16 denotes the physical size of the connector. Most modern motherboards offer at least one of X16 slot so a single GPU workstation is usually no problem.

Note that there are different revisions of the PCIe slot. The Quadro and Tesla cards are backward compatible with version 1.0 PCIe slots but will work best in version 2.0 slots.

You will discover that motherboards with more than one PCIe X16 slot will have some slots run faster than others depending on what is plugged into them. How this will effect total run time will depend on the software you are using. I have done some informal testing with Quadro and Tesla cards and the X-FDTD simulator. It appears that there is little difference between using a slot running at 8X speed (known as 8 lanes) as compared to one running at X16 speed (16 lanes). I believe that is due to the fact that the simulator essentially pushes the problem onto the GPU and keeps it there until the iterations are done. If you are using software that needs to keep pushing data into our out of the GPU then there may be a bigger effect.

In my case, I knew I needed to have support for at least two GPUs and preferably 3 or more so I had to dig through motherboard manuals to see what configurations were supported. For example, a motherboard may have 3 slots. Slots 1 and 2 may run at up to X16 if used singly or together. If you plug something into slot 3 however the speeds of slot 2 and 3 will throttle back to say X8. If the motherboard has four slots and you added a fourth card you may see the speed in all slots drop to X8. Ultimately, you will want to find a motherboard that offers the highest total throughput to the PCIe slots subject to the other considerations I discuss here.

Besides speed, you will also need to consider the spacing and arrangement of the various slots. The Quadro and Tesla cards are double width cards so you will have to be careful in assessing which slots can be used with a full complement of GPUs in the system. You can easily find that a needed PCI slot will be buried by a double width GPU in an adjacent slot or that the fourth PCIe slot is disabled once you’ve got cards in the previous 3 slots. Read the manuals carefully and plan accordingly!

Power Supply Considerations

High end GPUs suck power like crazy. The Quadro FX 5800 pulls some 185 Watts under full load. The Tesla C2070s are around 235 Watts. Throw three or four of these in a box along with a 6 or 8 core CPU, a bundle of RAM chips and a few disk drives and you’re going to be looking at power supplies in the Kilowatt range. To be sure of what you’re doing here I recommend using an online power supply calculator to be sure that you’ve got enough capacity. When in doubt get a bigger one!

When looking at power supplies you will see some that are listed as modular supplies. This means that you can add cables as needed between the power supply and the various components (e.g. motherboard, GPU, disk drives, etc.) in the system. Non-modular supplies have all or most of the cables hard soldered to the boards inside the power supply housing. In principle the soldered non-modular connections should be more reliable. The trouble is that the wires are all bundled and it’s much harder to wire up the system neatly. As a result, I prefer power supplies with modular wiring. Just be sure to double-check your wires to make sure they are well seated in the connectors and you should be fine.

Case and Cooling

Given the amount of heat generated by a GPU workstation you will want to think carefully when choosing a case. My suggestion is to err on the side of BIG! A big case will give room for big video cards and all the associated wiring while allowing for good air flow. It will also allow you to make full use of all available slots on your motherboard. I made a mistake the first time in choosing a moderate sized tower case only to discover that I couldn’t use a double width video card in the bottom edge PCIe slot due to conflict with the power supply location. I eventually got a larger case that had more slots and space between the lower PCIe slot and the top of the power supply that allowed the double width card to fit. The larger case also had wheels which I found quite handy when needing to move the machine with my sometimes creaky back.

Hard Drive

The slowest part of a GPU workstation is going to be the disk subsystem. I already had a nice RAID based file server so I did not feel the need to duplicate that functionality within the GPU workstation. Instead I opted for a modest sized Solid State Drive (SSD) that was just large enough to hold the OS, the simulator software and a modest number of good-sized simulation runs. I added a cheap SATA Terabyte Hard Drive to hold archived data locally so I didn’t have to go out over the network every time I need to pull up some older files. I believe this cost-effective approach of getting both size and speed makes good sense for the majority of people.

Over the last year I have tried out several different models of SSD. I had nothing but troubles with the OCZ Vertex and Kingston models that are based on the SandForce controllers so I decided to try out some of the SSDs from  Intel. I settled on a 510 Series for the GPU workstation since it operated at the full 6 GB/s SATA speed. I also purchased some of the 320 Series for my older desktop computers and found them to be very fast. So far I believe the Intel SSDs offer the best combination of speed, reliability and compatibility.

The Build

Performance and Usage Notes

Scaling

So far the system has performed well but I am still learning how to use it optimally. It seems that throughput varies depending on the simulation configuration – primarily memory requirements – and the type of outputs requested. If a job fits within the RAM of one GPU then I have found it slightly  more efficient to allow one GPU per simulation. I believe this is because splitting a job between GPUs inherently requires more traffic across the PCIe bus. If I have a very large job that requires two or more GPUs however the software manages that well with a near linear scaling in speed.

Tesla Temperature Concerns

I highly recommend that you use some temperature monitoring software / widgets to keep an eye on your GPUs; especially when you first get the system running to make sure that you don’t have an over heating situation. I was greatly concerned to see temperatures on the Tesla GPUs hover at around 80C during idle and approaching 90C on load. These temperatures are higher than I had experienced with the Quadro series and left me doing a lot of tests with various fan configurations in an attempt to lower them. Ultimately, I determined that high temperatures are normal for the Tesla cards. If you are concerned you can manually override the fan settings (see here) but I think you’re about as well off leaving that at default to enjoy the quieter workstation.

BIOS and Primary Display Concerns

I discovered that even though a system may have 3 different video cards installed that there is no way to select which one should be used for the primary display in BIOS. This means that the card in the first slot is always assumed to be the one connected to the monitor and displayed at POST.  This kind of thing may be a consideration if you are using Tesla cards without a video output or if you wanted a particular card to be in a particular slot while using a different card for video display due to slot speed or card spacing issues.

Summary

Above I have outlined some of the many of the factors you will need to consider when building a GPU workstation. My build was aimed solely at getting good throughput with the Remcom X-FDTD EM solver. Your needs may vary depending on what software you intend to use.

Alternative Route

If you already have a capable desktop computer or workstation with at least one PCIe X16 Version 2.0 slot you may want to consider the option of adding multiple GPUs using the Cubix GPU-Expander approach.  The models shown here offer a housing, backplane, power supply and connecting cables so that you can add up to four GPUs to an existing computer. I have not had my hands on these but it seems like a nice option. I went the route of building the dedicated GPU workstation because all of my existing machines were old and did not have the required slot.

Update July 5, 2012

Recently, I noticed the fan on one of the Tesla GPU’s was running at a substantially higher speed than the other during simulation runs. I check into it, and it turns out that the placement of the double-width Tesla cards on the Asus motherboard leaves very little space between two of the cards and limiting airflow into the upper one. This wasn’t a problem in the winter when there was plenty of cold air available, but now that it’s 105 F in the shade the central AC can’t keep the room temperature low enough to get away with the restriction. Fortunately, the over-sized nVidia Cooler Master case had enough room to allow the lower card relocated away from the upper card to allow for better airflow. The only problem was that the Asus motherboard didn’t have any more PCIe X16 slots. I did some looking and discovered a PCI express 16x extender cable from Orbit Micro that solved the problem. The Orbit cable was pricey but it allowed me to relocate the lower card and dramatically improve airflow into the upper card. There are cheaper cables to be had, but I figured it was better to get a good quality one rather than risk blowing up a Tesla card. If you’re facing similar issues of crowded cards or overheating, I can definitely recommend the extender cable approach to solve the problem.

Update April 13, 2013

A few months back I upgraded from the 6-core AMD FX-1055T processor to the new 8-core AMD FX-8350 model. Before installing the new CPU I had do to a firmware upgrade on the motherboard, but after that it was as simple as taking out the old CPU and inserting the new one. I’m really happy with the 8-core. It runs up to fast and stays cool even using the stock cooler. The extra cores definitely improve system responsiveness when the machine is heavily loaded with simulation jobs. I’ve also noticed that single threaded calculations like CAD import and certain simulator post processing calculations are also faster. This likely a result of the higher clock on the 8350 versus the 1055T. At less $200 I think you’ll find it’s a very cost effective upgrade if you need the extra cores and a bit more speed on single threaded apps.