Thursday, December 14, 2017
A 17.40 beta was released on October 16, with a final release following on October 30th. There have been some issues with corrupt versions of the final release, but I think they are resolved now. I encountered lots of problems with this release, which was much of the motivation for making this post.
Until earlier this year, the AMDGPU-PRO drivers were targeted at the new Polaris cards, and support for even relatively recent Tonga was lacking. Because of this, I was using the fglrx drivers for Tonga and Pitcairn cards. The primary reason for upgrading now is for large page support, which improves performance on algorithms that use a large amount (2GB or more) of memory. With the promise of better performance, and since fglrx is no longer being maintained, I decided to upgrade.
I've been using AMDGPU-PRO with kernel 4.10.5 for my Rx 470 cards, so I decided to use the same kernel. I can't say there is any problems with using a newer kernel like 4.10.17 or even 4.14.5, so they might work just as well. I left the on-board video enabled (i915), so I would not have to be connecting and disconnecting video cables when testing the GPUs. After installing Ubuntu 16.04.3, I updated the kernel and rebooted. For installing the AMDGPU-PRO drivers, I used the px option (amdgpu-pro-install --px), as it is supposed to support mixed iGPU/dGPU use.
My normal procedure for bringing up a multi-GPU machine is to start with a single GPU in the 16x motherboard slot, as this avoids potential issues with flaky risers. Even with just one R9 380 card in the 16x slot, I was having problems with powerplay. When it is working, pp_dpm_sclk will show the current clock rate with an asterisk, but this was not happening. After two days of troubleshooting, I concluded there is a bug with powerplay and some motherboards when using the 16x slot. When using only the 1x slots, powerplay works fine.
Since I wasn't able to use the 16x motherboard slot, testing card and riser combinations was more difficult. Normally when I have a problem with a card and riser, I'll move the card to the 16x slot. If the problems go away, I'll mark the riser as likely defective. Mining algorithms like ethash use little bandwidth between the CPU and GPU, so there is no performance loss to using 1x risers. Even the slowest PCIe 1.1 transfer rate is sufficient for mining. Using "lspci -vv", I could see the link speed was 5.0GT/s (LnkSta:), which is PCIe gen2 speed. Reducing the speed to gen1 would mean lower quality risers could be used without encountering errors.
My first thought was to try to set the PCIe speed in the motherboard BIOS. Setting gen1 in the chipset options made no difference, so perhaps it is only the speed used during boot-up before the OS takes over control of the PCIe bus. Next, using "modinfo amdgpu", I noticed some module options related to PCIe. Adding "amdgpu.pcie_gen2=0" had no effect. Apparently the module no longer supports that option. I could not find any documentation for the "pcie_gen_cap", but luckily the open-source amdgpu module supports the same module parameter. By looking at amd_pcie.h in the kernel source code, I determined "0x10001" will limit the link to gen1. I added "pcie_gen_cap=0x10001" to /etc/default/grub, ran update-grub, and rebooted. With lspci I was able to see that all the GPUs were running at 2.5GT/s.
For clock control, and monitoring I've previously written about ROC-smi.
==================== ROCm System Management Interface ====================
GPU DID Temp AvgPwr SCLK MCLK Fan Perf OverDrive ECC
3 6938 66.0c 100.172W 858Mhz 1550Mhz 44.71% manual 0% N/A
1 6939 64.0c 112.21W 846Mhz 1550Mhz 42.75% manual 0% N/A
4 6939 62.0c 118.135W 839Mhz 1500Mhz 47.84% manual 0% N/A
2 6939 77.0c 123.78W 839Mhz 1550Mhz 64.71% manual 0% N/A
GPU : PowerPlay not enabled - Cannot get supported clocks
GPU : PowerPlay not enabled - Cannot get supported clocks
0 0402 N/A N/A N/A N/A None% N/A N/A
==================== End of ROCm SMI Log ====================
I also use Kristy's utility to set specific clock rates:
ohgodatool -i 1 --mem-state 3 --mem-clock 1550
Unfortunately ethminer-nr doesn't work with this setup. I suspect the new driver doesn't support some old OpenCL option, so the fix should be relatively simple, once I make the time to debug it.
Wednesday, December 6, 2017
Since I started mining ethereum almost two years ago, I have found that power distribution is important not just for equipment safety, but also for system stability. When I started mining I thought my rigs should be fine as long as I used a robust server PSU to power the GPUs, with heavy 16 or 18AWG cables. After frying one motherboard and more than a couple ATX PSUs, I've learned a lot of careful design and testing is required.
Using Dell, IBM, or HP server power supplies for mining rigs is not a new idea, so I won't go into too much detail about them. I do recommend making an interlock connector so the server PSU turns on at the same time as the motherboard. I also recommend only connecting the server PSU to power the GPU PCIe power connectors, as they are isolated from the 12V supply for the motherboard. If you try to power ribbon risers, the 12V from the ATX and server PSUs will be interconnected and can lead to feedback problems. Server PSUs are very robust and unlikely to be harmed, but I have killed a cheap 450W ATX PSU this way. If you use USB risers, they are isolated from the motherboard's 12V supply, and therefore can be safely powered from the server PSU.
In the photo above, you might notice the grounding wire connecting all the cards, which then connects to a server PSU. I recently added this to the rig after measuring higher current flowing through two of the ground wires connected to the 6-pin PCIe power plugs. As I mentioned in my post about GPU PCIe power connections, there are only two ground pins, with the third ground wire being connected to the sense pin. With two ground pins and three power pins, the ground wires carry 50% more current than the 12V wires. Although the ground wires weren't heating up from the extra current, the connector was. Adding the ground bypass wire reduced the connector temperature to a reasonable level.
For ATX PSUs, I've used a few of the EVGA 500B, and do not recommend them. While even my cheap old 300W power supplies use 18AWG wire for the hard drive power connectors, the SATA and molex power cables on the 500B are only 20AWG. Powering more than one or two risers with a 20AWG cable is a recipe for trouble. I burned the 12V hard drive power wire on two 500B supplies before I realized this. I recently purchased a Rosewill 500W 80plus gold PSU that was on sale at Newegg, and it is much better than the EVGA 500B. The Rosewill uses 18AWG wire in the hard drive cables, and it also has a 12V sense wire in the ATX power connector. This allows it to compensate for the voltage drop in the cable from the PSU to the motherboard. The sense wire is the thinner yellow wire in the photo below.
Speaking of voltage drop, I recommend checking the voltage at the PCIe power connector to ensure it is close to 12V. Most of my cards do not have a back plate, so I can use a multi-meter to measure at the 12V pins of the the power connector where they are soldered to the GPU PCB. I also recommend checking the temperature of power connectors since good quality low-resistance connectors are just as important as heavy gauge wires. Warm connectors are OK, but if they so hot that you can't hold your fingers to them, that's a problem.
My last recommendation is for people in North America (and some other places) where 120V AC power is the norm. Wire up the outlets for your mining rigs for 240 instead of 120. Power supplies are slightly more efficient at 240V, and will draw half as much current compared to 120V. Lower current draw means less line loss going to the power supply and therefore less heat generated in power cords and plugs. Properly designed AC power cables and plugs should never overheat below 10-15 Amps, however I have seen melted and burned connectors at barely over 10A of steady current draw.