This is Part 2 of my series on upgrading GPU workstations. It is intended for hardware geeks that are interested in building high end GPU workstations for engineering, crypto mining and password cracking. It you’re not on that wavelength you can safely ignore all that follows. If you missed it, please read Part 1 before proceeding.
One step forward and two back
After digesting all the test data from Remcom, I reverted to my initial plan and purchased a new Quado 6000. I figured this was the cheapest possible upgrade and that it would still give a substantial increase in throughput while I worked out the next steps in the upgrade process. Indeed, after installing the Quadro 6000, I saw a nearly 50% reduction in simulation run time on problems that made good use of memory across all three Fermi cards. There was almost no benefit for very small problems that fit easily within the memory of a single GPU. The later limitation was expected and is a result of bandwidth limitation between cards. In short, for small jobs you’re better off to run them one at a time on separate GPU cards. There’s no advantage to using all of your GPU cards on a pint sized problem.
Even though throughput was up 50% after adding the new Quadro, I found that the workstation was no longer stable. I experienced numerous random freezes and X-org crashes that were clearly related to the presence of the Quadro 6000. My first thought was that it must be a video driver issue, so I tried various nVidia driver releases provided by Ubuntu, and ultimately the latest drivers from nVidia, but none of them resolved the issue. I also went through OS upgrades from Xubuntu 12.04 to 13.10 just to be sure that I wasn’t looking at a kernel or x-org issue. The problems persisted. Eventually, I began to think I had a bad Quadro card, since the system was completely stable with the old FX-5800 regardless of the software. Somewhat late in the process, I ran diagnostics using the nvidia-smi command and learned that the new Quadro was exhibiting ECC RAM errors. At that point, I was convinced it was a hardware problem, so I documented the errors and returned the card to the seller.
I immediately purchased a replacement from another vendor. It arrived the following week and within 10 minutes of installing it, the Remcom solver was again locked up with GPU ECC Errors! I won’t repeat the words here what came out of my mouth at that point, but suffice it to say, I was chapped! I did more tests, and again nvidia-smi confirmed hardware issues! What are the odds of getting two bad Quadro 6000 cards in a row from different vendors?
By this time, I needed to get going with this “easy” upgrade so I ordered yet a third Quadro 6000 and had it shipped over night. As you can imagine, I was crossing a lot more than my fingers when I powered up the system with the third card, but it worked and the system was completely and totally stable. Notably, it was stable with the original Xubuntu 12.04 OS and repository driver.
With the system stable, I queued up a bunch of simulation jobs and then worked to determine what was really going on with the second Quadro card. I installed it in a different system that used the same model Asus motherboard, and a 6-core AMD 1055T CPU. After I reset the ECC errors using the nvidia-smi command, I was amazed to see it run flawlessly even when pushed hard for days at a time. At one point I thought I was losing my mind so I again tried it in the primary workstation just to double check. Sure enough within a few minutes it would lock it up the system! Something was amiss? Why would one Quadro work and a second seemingly identical one crash the system?
A VBIOS issue?
I did some more digging and discovered that the first two cards had VBIOS version 70.00.57.00.02 with build date 04/08/11, while the third card (the one that worked) had VBIOS version 70.00.3C.00.05 with build date 08/03/10. In desperation, I actually attempted to flash the .57 series card with a .3C series VBIOS but the flash program indicated that it was an incompatible VBIOS. I didn’t press further for fear of bricking an expensive card. After further testing of the second card, I became convinced that there was in fact nothing wrong with it. I also decided that the first card (the one I had returned) was also most likely just fine. I contacted the seller and offered to buy it back after admitting my mistake. After it arrived, I retested it, and sure enough, it worked perfectly in the second workstation!
By this time, I had three Quadro 6000 cards in my hand, so I tried something different. I pulled the Tesla C2070 cards and put in just the three Quadros. Presto! The system was perfectly stable! Wow! What can you say? Apparently, some Quadros don’t play well with some Teslas in the same system, even though they are the same architecture? Go figure? I’ve been unable to find any other references to this situation on the web so if any of you nVidia gurus have insights on the problem please enlighten me via the comment section!
A Quad GPU AMD Motherboard
Once I had a stable system, my next step was to try a quad GPU setup. The cheapest path to this configuration was to replace the Asus Sabertooth motherboard with a Gigabyte GA990FXA-UD7. This Gigabyte board is the only AMD motherboard on the market that supports up to four double width nVidia GPUs. In a quad configuration it has 8 PCIe lanes to each card so bandwidth is uniformly good to all the cards. Getting more than that entails an upgrade to an Intel based server motherboard, XEON processor and even more kilobucks of outlay. The motherboard swap was easy, since the AMD FX-8350 CPU and ECC RAM could be re-used in the Gigabyte board. The only downside I noted on the Gigabyte was that it had fewer fan headers than the Asus board. It was otherwise a drop-in replacement with a better PCIe slot layout. Please note that I do NOT do any over-clocking so my evaluation will likely differ from those looking at it for gaming purposes.
With the system rebuilt around the new motherboard, I powered it up with three, and then fully four GPUs. As before, I noted a stable system when using the Tesla C2070 cards along with the Quadro 6000 having VBIOS 70.00.3C.00.05. Likewise, everything worked fine when using all of the Quadros together. Crashes were still evident when I mixed the C2070 and newer Quadro cards in the system. To get a working quad GPU system I was forced to buy yet another Tesla C2070! Ouch!
As an aside, I note that I deliberately purchased revision 1.x of the Gigabyte board, rather than the newer revision 3.0. I opted for the older version because I had read on various forums that it offered unofficial support for ECC memory. I did not know this at the time, but it is apparently rather difficult to actually confirm that ECC RAM function is working properly on any motherboard. This article suggests several avenues for confirming ECC operation. Using it as a guide, I can confirm that revision 1.x of the GA990FXA-UD7 with firmware update F10 does the following:
- The board functions fine with unbuffered ECC RAM whether ECC function is enabled or not.
- The board recognizes the presence of ECC RAM, and the BIOS functions to control it are operational. The Gigabyte options are actually more extensive than those offered by Asus which simply offers and enable function.
- The memtest 86+ revision 4.x indicates that ECC is present, but revision 4.x apparently had issues with proper detection of ECC function so I’m not sure if that is to be trusted.
- ubuntu dmi-decode is inconclusive and indicates that ECC detection is present, but correction is not and that memory is 64 bit versus the 72 bit length as reported by ECC enabled Asus Sabertooth board.
- ecc_check.c is inconclusive
- I have not yet tried to block traces on my working ECC modules to test the function.
In short, I think ECC is working, but I’m not 100% confident in that assertion. If any of you have input or can confirm 100% that ECC works on these boards I’d be interested to hear from you.
I use a UPS / Battery Backup on all my computers. I had been using an old APC Back-UPS RS 1500 with extended battery pack on my GPU workstation. Rated at about 900 Watts it could handle the triple GPU setup without difficulty. Adding the fourth card and starting a simulation on all of them at once immediately kicked off the overload alarms so I had to search for something bigger. My research lead me to purchase the Tripp-Lite SU1500RTXLCD2U 1500VA 1350W Smart Online Rack mount UPS. This is the largest wattage UPS I could find that uses the standard NEMA 5-15 plugs that are common in most homes and businesses. Anything larger and the upgrade is likely going to implicate the services of an electrician. The run time is necessarily short ( about four minutes) at full load, but it can be increased, if necessary, by purchasing additional batteries.
I like this unit. It’s very solid and pretty much plug and play. While it’s not exactly silent, the fan is not overly loud as apparently some of these higher power units can be. If you’re running a quad GPU machine, the UPS fan is likely the least of your auditory worries however! I note that this is a “smart online” model, meaning that it’s constantly generating the power signal from the battery. This is different from cheaper line interactive models that switch on and off battery according to the incoming power signal quality. Even though something like Tripp-Lite SMART1500RMXL2UA 1500VA 1350W Rack mount UPS would likely have been fine, I opted for the online model so that I’d have the best protection for the expensive GPUs.
Cooling and 4U Rack mount case
Along with the change in motherboard, I also decided to try a new rack mountable case: The Chenbro RM41300-FS81 is a 4U server chassis designed specifically to hold and cool up to four Tesla GPUs. In fact, it’s the only rack mountable case I have been able to find which is capable of doing this! As far as I know all other cases that support quad double width GPUs are of the full tower form factor such as the Cooler Master HAF-X. While there’s nothing wrong with the tower cases, I have a lot of rack mounted test equipment in my RF/microwave lab, and I’ve come to see the advantages of going that route. There are also options for open air frames with no sides, but I prefer to have equipment shielded and enclosed if possible. For now, I am keeping that as an option only if I find cooling to be a problem.
I’ll have to admit that I was skeptical that the smaller 4U case would cool adequately, but so far it’s been just fine. Fan speeds on the Tesla cards range from about 50% to about 70% when under load, with the middle cards running harder due to restricted airflow and heat from adjacent cards. Temperature on the Fermi cards always seems to hover around 89C when under load. Interestingly, the newer Quadro cards seem to run as much as 20 C cooler than the C2070 cards when not loaded. Fan speeds and temperature under load still seem to reach the same levels however. The real test of cooling will come in July when it’s 105 in the shade and the central air is straining to keep the ambient to 80 F. I’ll keep you posted on how that works out!
As far as noise is concerned, the fans in the Chenbro are definitely louder than the HAF-X. I wouldn’t want a Chenbro in my living room, but it’s ok for a garage lab. Beyond that, I really like the Chenbro. The quality and workmanship are first-rate. My only minor complaints would be the lack of external ports, and the blindingly bright blue power LED. The former can be remedied with after market panels in one of the drive bays. The later, I fixed with a snip of black electrical tape on the LED lens. I neglected to take photos when assembling my build, but you can see some good photos here if you’re interested in this exceptional case.
The work on the next part of this article is still in progress, but I’m aiming to set up a pair of multi-GPU workstations and use MPI (Message Passing Interface) to get it all working on a single problem. Until then…
Thanks for reading!
Update on ECC Testing May 19, 2014
I had the Gigabyte 990fxa-ud7 motherboard off line for several months. I recently put it back in service and circled back to investigate the ECC issue. I found a reference here showing another way to confirm ECC operation by using linux edac. The command line output shown below seems to confirm that the board does have ECC function enabled.
~$ dmesg | grep -i edac
[ 17.274327] EDAC MC: Ver: 2.1.0
[ 17.275730] AMD64 EDAC driver v3.4.0
[ 17.335091] EDAC amd64: DRAM ECC enabled.
[ 17.335103] EDAC amd64: F15h detected (node 0).
[ 17.335165] EDAC MC: DCT0 chip selects:
[ 17.335167] EDAC amd64: MC: 0: 2048MB 1: 2048MB
[ 17.335168] EDAC amd64: MC: 2: 2048MB 3: 2048MB
[ 17.335170] EDAC amd64: MC: 4: 0MB 5: 0MB
[ 17.335172] EDAC amd64: MC: 6: 0MB 7: 0MB
[ 17.335173] EDAC MC: DCT1 chip selects:
[ 17.335175] EDAC amd64: MC: 0: 2048MB 1: 2048MB
[ 17.335177] EDAC amd64: MC: 2: 2048MB 3: 2048MB
[ 17.335178] EDAC amd64: MC: 4: 0MB 5: 0MB
[ 17.335180] EDAC amd64: MC: 6: 0MB 7: 0MB
[ 17.335182] EDAC amd64: using x4 syndromes.
[ 17.335184] EDAC amd64: MCT channel count: 2
[ 17.335205] EDAC amd64: CS0: Unbuffered DDR3 RAM
[ 17.335205] EDAC amd64: CS1: Unbuffered DDR3 RAM
[ 17.335205] EDAC amd64: CS2: Unbuffered DDR3 RAM
[ 17.335206] EDAC amd64: CS3: Unbuffered DDR3 RAM
[ 17.335265] EDAC MC0: Giving out device to ‘amd64_edac’ ‘F15h’: DEV 0000:00:18.2
[ 17.335443] EDAC PCI0: Giving out device to module ‘amd64_edac’ controller ‘EDAC PCI controller’: DEV ‘0000:00:18.2’ (POLLED)
The command edac-util -v shows that there are as yet no ECC errors. That’s good!
~$ edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: ch0: 0 Corrected Errors
mc0: csrow0: ch1: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: ch0: 0 Corrected Errors
mc0: csrow1: ch1: 0 Corrected Errors
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: ch0: 0 Corrected Errors
mc0: csrow2: ch1: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: ch0: 0 Corrected Errors
mc0: csrow3: ch1: 0 Corrected Errors
At this point, I’m satisfied that ECC is enabled and working on the board. The situation with dmi-decode incorrectly reporting memory characteristics is apparently not unusual with AMD boards.
In this article
- PC Hardware
- 0 corrected
- 0 corrected errors
- 0 uncorrected errors
- amd64 mc
- amd64 mc 0 2048mb 1
- amd64 mc 2 2048mb 3
- ch0 0 corrected
- ch0 0 corrected errors
- ch1 0 corrected
- ch1 0 corrected errors
- corrected errors
- edac amd64
- edac amd64 mc
- edac amd64 mc 0 2048mb
- edac amd64 mc 2 2048mb
- edac amd64 mc 4 0mb
- edac amd64 mc 6 0mb
- mc 0 2048mb 1 2048mb
- mc 2 2048mb 3 2048mb
- unbuffered ddr3 ram